R PDF
R PDF
UNIT-I - Familiarizing with R environment, Using R console as a calculator, R atomic types, meth-
ods of creating vectors, combining vectors and repeating vectors, different ways of subsetting vectors
using indexing, names and logicals. Arithmetic and logical operations. Using character vectors for text
data, manipulating text using strsplit(), paste(), cat(), grep(), gsub() functions; handling factor data.
working with dates.
UNIT - II - Creating Matrices, getting values in and out of matrices, performing matrix calcula-
tions; Working with multidimensional Arrays; creating data frames, getting values in and out of data
frames, adding rows to data frame, adding variables to data frame; creating lists, extracting components
from a list, changing values of components of lists. Getting data into and out of R - reading data in
CSV files, EXCEL files, SPSS files and working with other data types. Getting data out of R - working
with write.csv() and write.table() functions.
UNIT - III - Writing Scripts and functions in R. writing functions with named, default and optional
arguments. functions using as arguments. Debugging your code. Control statements in R - conditional
control using if, if-else, ifelse; looping control using for, while, repeat; transfer of control using break
and next. Manipulating and processing data - creating subsets of data, use of merge() function, sorting
and ordering of data. Group manipulation using apply family of functions - apply, sapply, lapply, tapply.
UNIT - IV - Base graphics. Use of high-level plotting functions for creating histograms, scatter
plots, box-whiskers plot, bar plot, dot plot, Q-Q plot and curves. Controlling plot options using low-
level plotting functions - Adding lines, segments, points, polygon, grid to the plotting region; Add text
using legend, text, mtext; and Modify/add axes, Putting multiple plots on a single page.
UNIT - V - Working with probability distributions - normal, binomial, Poisson and other distribu-
tions. Summary statistics, hypothesis testing - one and two-sample Student’s t-tests, Wilcoxon U-test,
paired t-test, paired U-test, correlation and covariance, correlation tests, tests for association- Chi-
squared test and goodness-of-fit tests. Formula notation, one-way and two-way ANOVA and post-hoc
testing, graphical summary of ANOVA and post-hoc testing, extracting means and summary statistics;
Simple linear regression
Text Books:
1. Mark Gardener(2012), Beginning R - The Statistical Programming Language, Wiley India Pvt
Ltd.
2. Andrie de Vries and Joris Meys(2015), R Programming for Dummies, Wiley India Pvt Ltd.
3. Jared P. Lander(2014), R For Everyone - Advanced Analytics and Graphics, Pearson Education
Inc.
R - Vectors-1
R - A Quick Start
Types of data
The basic data types in R are called atomic types. They are:
numeric,
integer,
character,
logical,
complex, and
raw.
Operators
Operator Purpose
+ addition
− subtraction
/ division
∗ multiplication
^ exponentiation
%% modulus
& logical AND
| logical OR
> 2 + 3
[1] 5
> 2 -3
[1] -1
>
> 2 * 3
[1] 6
>
> 2 / 3
[1] 0.6666667
>
> 2^3
[1] 8
> 2%%3
[1] 2
>
1
September 6, 2020
R - Vectors-1
Objects:
Variables in R are called objects. The rules for creating names of the objects are the same
as that in C language. However, you can also use the period symbol to create a multi-word
objects. For example, x.bar is valid object in R, which you use to refer to the average of a
number of values. Another character you can use in creating multi-word object names is the
underscore symbol(for example, x bar).
> x <- 2
> y <- 3
>
> x + y
[1] 5
> x - y
[1] -1
>
> x * y
[1] 6
> x / y
[1] 0.6666667
>
> x^y
[1] 8
>
For example, a vector, named x, is created using the c() function as follows:
The above command, creates a numeric vector consisting of values 5, 2, 6, and 10, in that
order, and stores it in the object x. In the second line you see again the command prompt. It
means that the command was executed without error and is waiting for your next command.
2
September 6, 2020
R - Vectors-1
> x
[1] 5 2 6 10
>
> mode(x)
[1] "numeric"
>
If you simply enter the vector name followed by enter, the contents of the vector will be
displayed, as in the above example. The mode() function returns the storage mode of an
object. In the case of the object x above, it is numeric.
Create a character vector, for example,
The object char is a character vector. Similarly, you can create a logical vector,as in the
following example:
A vector is meant for storing same type values and manipulations on them. What happens
if we put different types into same vector?
Observe that the numeric values when mixed with character values will be converted into
character type.
3
September 6, 2020
R - Vectors-1
Logical values gets converted into characters if they occur together with character type
in a vector.
> num.logi <- c(1,2,TRUE,FALSE)
> num.logi
[1] 1 2 1 0
>
> mode(num.logi)
[1] "numeric"
>
Logical values are converted into numeric type, if they appear together with numeric
values.
This is called coercion. The R system automatically converts the lower type data in a
vector to a higher type.
Exercise: (a) Create vector called mid.marks with the values 18, 20, 12, 15.
(b) Create a vector called grades whose members are ”A”, ”O”,
”A+”, ”A”, ”B”
(c) Create a vector called results whose members are 5 >= 2,
5 > 2, 5 < 2, 5 <= 2, 5 == 2.
Now, let us inspect the contents of the object x. In R, indexing of elements of a vector
starts at 1. You can fetch the members of a vector by suffixing the object name with a pair
of square brackets [] and enclosing an integer inside of it. For example, x[2] means the
second member of the vector x. For example,
> x[3]
[1] 6
Whenever you see [1] as the first character of the R System response, it means that the
result of your command is a vector and the index of the first member is 1. Here, the result
of the command x[3] is a vector(because you see a []) of size one, since 6 is the only value it
displayed.
You can fetch more than two values using a command of the type
> vec1[vec2]
It means that, first, create vector, say vec2, whose elements are the indices of the members
of the vector vec1, which you want to pull out. Secondly, you pass vec2 as the index of vec1
4
September 6, 2020
R - Vectors-1
vector.
To get the first two members of our earlier vector x, use the command:
> x[c(1,2)]
[1] 5 2
So, the result of the command x[c(1,2)] is a vector whose first member is 5 and the second
member is 2.
Similarly, x[c(4,2,3)] results in the vector 10, 2, and 6.
The : operator
The : operator creates a sequence of whole numbers differing by one. The general syntax is
> start_value : end_value
> x[-2]
[1] 5 6 10
>
> x
[1] 5 2 6 10
Remember that the command x[-2] does not remove the element from x. To delete the
second element from x, you have to assign x[-2] to x:
5
September 6, 2020
R - Vectors-1
The length of the x is now 4. Unlike in C, the size of a vector in R can be increased or
decreased.
To add an element, say 100, in the beginning of the vector x,
> y[-c(1:2)]
6
September 6, 2020
R - Vectors-1
Vectorization:
>
> x <- 1:5
> x
[1] 1 2 3 4 5
>
> x^2
[1] 1 4 9 16 25
>
> sqrt(x)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068
>
> sqrt(x^2)
[1] 1 2 3 4 5
>
>
Notice that, the command x2 computes the square of each element of the vector x.
Similarly, the command sqrt(x) computes the square root of every member of the vector x.
This is called vectorized operations. To achieve this in a procedure-oriented programming
languages such as C, one has to use a ”for loop”. In R, vectorization of operations avoids
use of ”for loops”.
7
September 6, 2020
R - Vectors-1
Recycling:
Recycling refers to the process of how a smaller vector recycles to meet the length of the
larger vector when a mathematical operation is performed using two vectors. We illustrate
this with reference to addition operation. For example,
(i) If both the vectors of equal length, the addition of two vectors is performed as follows:
(ii) If one of the vectors involved in an addition operation has a length twice as that of the
other, the smaller vector gets recycled until its length equals the length of the larger
vector and then the addition operation is performed on those resulting vectors.
(iii) Suppose the two vectors involved in an addition operation are unequal in their lengths
and the length of the larger vector is not an integer multiple of the smaller vector.
Then the smaller vector gets recycled to meet the length of the larger vector and
then addition operation is performed. In this case, we also get a warning about the
differences in the lengths of the vectors involved in the operation.
Filtering a vector for a subset of its elements can be done in one of the following methods:
8
September 6, 2020
R - Vectors-1
where vector1 is the vector to be filtered and vector2 is a vector of indices or logical
values or member names.
> x<- c(5,16,18,8,1,11,4,10,15)
> x
[1] 5 16 18 8 1 11 4 10 15
>
Suppose we want to filter the vector for 4th, 5th and 1st elements in that order. Then,
with reference to the general syntax for subsetting a vector, vector2 is then c(4,5,1) and
vector1 is x. Thus, x[c(4,5,1)] is the command to be used to fetch the required members
from the vector x.
> x[c(4,5,1)]
[1] 8 1 5
>
9
September 6, 2020
R - Vectors-1
> x
[1] 5 16 18 8 1 11 4 10 15
>
># omitting the 3rd element
> x[-3]
[1] 5 16 8 1 11 4 10 15
>
> # leaving the last element
> x[-length(x)]
[1] 5 16 18 8 1 11 4 10
>
> # Getting all the elements except the first three
> x[-c(1:3)]
[1] 8 1 11 4 10 15
>
y[c(TRUE,FALSE,TRUE,FALSE,TRUE)] or
y[c(TRUE,FALSE)]
If a logical vector is used to subset a vector, then the size of the logical vector must be
the same as the size of the vector to be filtered.
The command y[c(TRUE,FALSE,TRUE,FALSE,TRUE)] is equivalent to y[c(TRUE,FALSE)],
here, the indexing vector gets recycled until its size becomes the size of y.
In practice, the logical vector is created out of a condition to be met by the vector. For
example, if we are interested in the members of a vector whose values are larger than 6. The
condition here is that x¿6. If we write x¿6, it results in a vector of 10(length of x) logical
values. This is because in the operation x¿6, two vectors of unequal length are involved.
Because of the recycling property, the smaller one(6, being a vector of size 1) gets recycled
to meet the size of the larger vector. Now we have
10
September 6, 2020
R - Vectors-1
In fact, the operators +, ¿ etc., are all functions. As the operation involved here is
relational, the result of which is logical. Hence, we have
(FALSE,TRUE,TRUE,TRUE,FALSE,TRUE,FALSE,TRUE,TRUE).
The R system omits all those indexing values having the value FALSE. The vector
c(FALSE,TRUE,TRUE,TRUE,FALSE,TRUE,FALSE,TRUE,TRUE)
= c(2,3,4,6,8,9)
is used as the indexing vector to filter x for all those values larger than 6.
> x
[1] 5 16 18 8 1 11 4 10 15
>
> x>6
[1] FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE
>
> x[x>6]
[1] 16 18 8 11 10 15
>
In practice, we use a function called which() to get the indices of a vector satisfying some
condition. In the above illustration, which(x¿6) will return the indices of the vector at which
the values of x are larger than 6.
11
September 6, 2020
As a C programmer you know the basic types(integer, floating-point, character, logical) supported by the
C language. R too has some basic types and are usually referred to as atomic types. They are: numeric,
integer, character, logical, complex and raw. The numeric type is like double in C language.
R support a number of data structures, namely, vectors, matrices, arrays, dataframes and lists.
"<-" is the assignment operator. You can also use "=" as assignment operator. But the preferred symble
is "<-". It is a matter of personal taste.
> x <- 2
>
>x
[1] 2
[1] represents the index of the value 2 in vector named x. [] is the indexing operator.
The indexing of the members of vectors starts at 1 and not at 0 as in the C language.
[1] 2
>2
[1] 2
That is 2 is the first member of an unnamed vector. Vectors in R work like one-dimensional arrays in C
language. So, the members of a vector must be of the same atomic type or mode. That is, we can have
numeric vectors, character vectors, logical vectors, vectors of complex numbers.
> 2 -> x
>x
[1] 2
>
[1] FALSE
> is.numeric(2)
[1] TRUE
>
If you want to convert the number 2 into an integer, use the function as.integer().
> is.integer(x)
[1] TRUE
By suffixing a numeric constant with the letter L, you can convert that number from numeric to an
intger.
> is.integer(2L)
[1] TRUE
is.character(),
is.logical(),
is.complex(),
as.character(),
as.logical(),
as.complex() ????
The symbol # is used to make comments in R scripts. R ignores everything written after #.
R as a Calculator:
------------------
[1] 5
>
>
[1] -1
>
>
[1] 6
>
>
> 2 / 3 # / is for division
[1] 0.6666667
>
>
[1] 8
>
>
[1] 2
>
[1] 2
>
Mostly, we store the data into variables and use those variables in our computations. The first character
of a variable name must be an alphabet and the rest of the characters could be alphabets or numerals.
We can also use the period symbol(.) as well as the underscore( _ ) symbols for creating multi-word
variable names; for example, x.mean (or x_mean) could be a variable name representing the average
value of the variable x.
> x <- 2
> y <- 3
>
>x+y
[1] 5
>x-y
[1] -1
>
>x*y
[1] 6
>x/y
[1] 0.6666667
>
> x^y
[1] 8
>
------------------------------
------------------
While subsetting the various data structures, we frequently require to create sequences of integers. The
colon(:) operator can be used to create vectors consisting of sequences of integers. The left operand of
colon represents the starting integer and the right operand represents the editing integer of the
sequence. If the left operand is smaller than the right operand, then it results in an increasing sequence
of integers whose successive members differ by one. If the left operand is larger than the right operand,
then it results in an decreasing sequence of integers whose successive members differ by one.
> 1:5
[1] 1 2 3 4 5
>
> 5:1
[1] 5 4 3 2 1
>
>
> 1.2:5
>
> 5.05:3
>
-------------------
The seq() function will be useful to create vectors of specific pattern. The following are different forms
of seq() function using which we can create vectors:
(i) seq(from,to)
For example seq(from=1, to=5) or simply, seq(1,5) will create a vector of values starting from 1 to 5
incremented by 1. That is, it creates the vector (1,2,3,4,5). Here the start value is smaller than the end
value. If the start value is larger than the end value, it creates a vector of values in descending order. For
example, seq(5,1) will return the vector (5,4,3,2,1).
> seq( 1, 5 )
[1] 1 2 3 4 5
> seq( 5, 1 )
[1] 5 4 3 2 1
>
> seq(1.1,6)
>
> seq(5.2,1.2)
>
> seq(5.2,0)
>
Note that the first argument to the function seq() will be assigned to the named parameter from and the
second argument will be assigned to the named parameter to. Both the parameters from and to are
default parameters with a default value of 1. That is, from=1, to=1.
If our objective is not a sequence of integer values, we can specify the increment value using the named
parameter by. The general form of seq() function with three parameters is:
(ii) seq(from,to,by)
Note that by is also a default parameter and whose default value is obtained from the expression (to-
from)/(lenth-1), where length is the size of the vector.
> seq( 1, 5, 2 )
[1] 1 3 5
On some occasions, we want to have a vector of specific length whose start vale and end value being
known. In such occasions, use the following form of seq() function:
Suppose we want a vector of 12 values that starts with 1 and ends with 2.
> seq(1,2,len=12)
>
Consider the function call: seq(5). It will generate a sequence of integers 1,2,3,4,5. This is because, it
defaults to the function call seq(1,len=5). That is, the function calls, seq(from=5) and seq(length.out=5)
are equivalent. This is also equivalent to the function call seq(along.with=5). So, we seq() function also
supports the function prototypes:
(iv) seq(from)
(v) seq(along.with=)
(vi) seq(length.out=)
-------------------
Use the rep() function, to create a vector of specified length with the same value being repeated.
rep(x, freq.of.x)
where if x is a scalar, then freq.of.x is also a scalar; if x is a vector of values, then freq.of.x must be a
vector specifying the number of times each member of x must be repeated. For example,
rep(4,3)returns the vector (4,4,4).
> rep(4,3)
[1] 4 4 4
> rep(c(1,3),c(3,2))
[1] 1 1 1 2 2
>
[1] 1 2 3 1 2 3
>
> rep(c("S","F"),c(2,3))
>
----------------
The most frequently used function to create a vector is c(). The name of the function c is usually
understood to represent combine or concatenate. The general form of creating a vector using the c()
function is:
Examples:
(b) c('a','b','c')(or c("a","b","c")) creates a character vector whose members are "a", "b", "c".
(e) The c() function can be used to combine two or more vectors into a single vector.
For example,
will create a vector with named members. These names will be useful in filtering the vector.
>
>x
20 18 19
>
It is possible to give names to an already created vector using the function names().
For example, x=c(1,2,3). Then
> names(x)
>
>x
1 2 3
Subsetting Vectors
------------------
Filtering a vector for a subset of its elements can be done in one of the following methods:
-----------------------------
Indexing of members of a vector starts at 1. In general, subsetting of a vector takes the form:
vector1[vector2]
where vector1 is the vector to be filtered and vector2 is a vector of indices or logicals or member names.
>x
[1] 5 16 18 8 1 11 4 10 15
>
Suppose we want to filter the vector for 4th, 5th and 1st elements in that order. Then, with reference to
the general syntax for subsetting a vector, vector2 is then c(4,5,1) and vector1 is x. Thus, x[c(4,5,1)] is
the command to be used to fetch the required members from the vector x.
> x[c(4,5,1)]
[1] 8 1 5
>
------------------------------
R system also allows us to use negative indices. Negative indices allows us to omit one or more elements
of a vector.
For example, x[-1] displays all the elements of x except the first element; x[-c(2,5)] displays all the
elements of x except 2nd and 5th elements.
>x
[1] 5 16 18 8 1 11 4 10 15
>
> x[-3]
[1] 5 16 8 1 11 4 10 15
>
> x[-length(x)]
[1] 5 16 18 8 1 11 4 10
>
> x[-c(1:3)]
[1] 8 1 11 4 10 15
>
Filtering logicals
------------------
A vector of logical values can also be used as indexing vector. For example, if y = c(1,2,3,4,5), then to
fectch the alternative values starting at index 1 use the command
y[c(TRUE,FALSE,TRUE,FALSE,TRUE)] or
y[c(TRUE,FALSE)]
> y <- 1:5
>y
[1] 1 2 3 4 5
> y[c(TRUE,FALSE,TRUE,FALSE,TRUE)]
[1] 1 3 5
>
> y[c(TRUE,FALSE)]
[1] 1 3 5
>
If a logical vector is used to subset a vector, then the size of the logical vector must be the same as the
size of the vector to be filtered.
In practice, the logical vector is created out of a condition to be met by the vector. For example, if we
are interested in the members of a vector whose values are larger than 6. The condition here is that x>6.
If we write x>6, it results in a vector of 10(length of x) logical values. This is because in the operation x>6,
two vectors of unequal length are involved. Because of the recycling property, the smaller one(6, being a
vector of size 1) gets recycled to meet the size of the larger vector. Now we have
(FALSE,TRUE,TRUE,TRUE,FALSE,TRUE,FALSE,TRUE,TRUE).
The R system omits all those indexing values having the value FALSE. The vector
c(FALSE,TRUE,TRUE,TRUE,FALSE,TRUE,FALSE,TRUE,TRUE)
= c(2,3,4,6,8,9)
is used as the indexing vector to filter x for all those values larger than 6.
>x
[1] 5 16 18 8 1 11 4 10 15
>
> x>6
[1] FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE
>
> x[x>6]
[1] 16 18 8 11 10 15
>
In practice, we use a function called which() to get the indices of a vector satisfying some condition. In
the above illustration, which(x>6) will return the indices of the vector at which the values of x are larger
than 6.
> which(x>6) # returns indices
[1] 2 3 4 6 8 9
>
[1] 16 18 8 11 10 15
>
[1] 16 18 8 11 10 15
>
Filtering by names
------------------
The c() function can be used to create a vector having its members given some names(named vectors).
For example,
> n = c(phy=12,math=13,chem=20)
> n["phy"]
phy
12
> n[c("phy","chem")]
phy chem
12 20
>
> n[c(TRUE,FALSE,TRUE)]
phy chem
12 20
>
Note that the member names must be enclosed in quotes in the indexing vector.
------------------------------------------------------------------------------
Operators:
operator function
-------- --------
+ addition
- subtraction
* multiplication
/ division
^ exponentiation
%% modulus
-------- ------------------
= assignment operator
-------- -------------------
== equality comparison
!= not equal to
-------- -------------------
Functions:
(i) c()
(ii) seq()
(iii) rep()
(i) is.numeric()
(ii) is.integer()
(iii) is.character()
(iv) is.logical()
(i) as.numeric()
(ii) as.integer()
(iii) as.character()
(iv) as.logical()
(d) which() --- results in a logical vector.
(e) names() --- when invoked on a vector, it returns the names of the members(if the members are
named) otherwise returns NULL
R - The sample() function
The R system provides a function called sample() using which you can draw a with
replacement or without replacement or varying probability random samples. First parameter
this function is the data vector from which to sample. The second parameter is the sample
size. To get a with replacement sample use the replace = TRUE option. To get a without
replacement sample use replace = FALSE. The default value of replace = option is FALSE.
So, if you use only the first two parameters, you get a without replacement sample, provided
the sample size is less than the length of the data vector. This function can also be used
two sample different values with different probabilities. To achieve this we have to use
probability= option. It is provided with a vector of probabilities whose length is same as
that of the data vector. To illustrate the usage of the sample() function, suppose the data
vector consists of 0 and 1. These values may correspond to say tail and head respectively,
in a coin toss experiment. So, the command
will result in a sequence of zeros and ones and will represent a realization of a coin toss
experiment for 5 times.
Systematic Sampling
Step 1: Create a vector of names of your classmates in the order of their names as in the
attendance register. See that the name must be a single word. Call this vector as mcs.stu.
Step 2: Use the sample() function to create a vector called marks. The data to the
sample() function must have values in the range 12 to 20. The length of the marks vector
must equal the length of the vector created in Step 1.
Step 3: Find the average of the vector marks. This is the mean of the population.
Step 4: Use names() function to assign the names in mcs.stu vector to the members of
marks vector created in Step 2.
Step 5: Use the sample() function to generate random start value between 1 and 5, where
5 is the Sampling Interval. Name this object as random.start
Step 6: Use the seq() function to generate the systematic sample labels. Name this vector
as sys.labels.
Step 7: Use the sys.labels vector as the indexing vector in marks vector to get a systematic
sample. Name the resulting vector as sys.sample.
Step 8: Invoke the mean() on sys.sample to get the systematic sample mean.
Step 9: Compare the agreement between the the population mean and the sample mean.
1
September 22, 2020
Manipulating Text Data
L. V. Rao
strsplit(x, split)
where
[[2]]
[1] "Empty" "vessels" "make" "much" "noise"
>
> nchar("x")
[1] 1
> nchar("xy")
[1] 2
> nchar("xy ")
[1] 3
grep() function
Is there any car with the name Mazda included in the mtcars
dataset?
> rownames(mtcars)[1:4]
[1] "Mazda RX4" "Mazda RX4 Wag"
[3] "Datsun 710" "Hornet 4 Drive"
>
> noquote(rownames(mtcars))[1:4]
[1] Mazda RX4 Mazda RX4 Wag
[3] Datsun 710 Hornet 4 Drive
>
> print(rownames(mtcars)[1:4], quote = FALSE)
[1] Mazda RX4 Mazda RX4 Wag
[3] Datsun 710 Hornet 4 Drive
where,
usage:
substr(x, start, stop)
x is a character vector,
start indicates the first element
to be replaced, and
stop indicates the last element
to be replaced:
> substr("Programming",4,7)
[1] "gram"
>
-------------------------------------------------------
-----------------------------------------------------------
3. How many car names have three words in their name? What are they?
--------------------------------------------------------------
4. How many one word, two word and three word car names are there?
---------------------------------------------------------
-----------------------------------------------------------
--------------------------------------------------------------
------------------------------------------------------------
> grep("Merc",cnames)
[1] 8 9 10 11 12 13 14
> ind <- grep("Merc",cnames)
> mtcars[ind,"mpg"]
[1] 24.4 22.8 19.2 17.8 16.4 17.3 15.2
> range(mtcars[ind,"mpg"])
[1] 15.2 24.4
> diff(range(mtcars[ind,"mpg"]))
[1] 9.2
>
9. Which Merc car is having maximum mileage?
> grep("Merc",cnames)
[1] 8 9 10 11 12 13 14
> ind <- grep("Merc",cnames)
> mtcars[ind,"mpg"]
[1] 24.4 22.8 19.2 17.8 16.4 17.3 15.2
>
> cnames[ind[which(mtcars[ind,"mpg"]==max(mtcars[ind,"mpg"]))]]
[1] "Merc 240D"
>
---------------------------------------------------------------
L. V. Rao
substring()
substring()
Extraction Function
substring(text, first, last = 1000000L)
Replacement Function
> Sys.time()
[1] "2020-10-20 20:25:13 IST"
> date()
[1] "Tue Oct 20 20:25:25 2020"
toupper(x)
tolower(x)
> x
[1] 1 2 2 1 1 1 0 0 0 0 0
> y
[1] 1 1 2 1 1 0 1 1 0 0 0
> z
[1] 0 1 0 1 1 0 0 0 1 1 1
>
- ,, October 29, 2020 11 / 1
Computing Cosine Similarity
> sim.xy
[1] 0.8215838
> sim.xz
[1] 0.4714045
> sim.yz
[1] 0.3872983
>
L.V. Rao
The scan() command can be used to create data objects with the
data from other programs such as spreadsheets or notepad.
1. If the data are numbers in a spreadsheet, simply type the
command in R as usual before switching to the spreadsheet
containing the data.
2. Highlight the necessary cells in the spreadsheet and copy
them to the clipboard.
3. Return to R and paste the data from the clipboard into R.
As usual, R waits until a blank line is entered before ending
the data entry so you can continue to copy and paste more
data as required.
4. Once you are finished, enter a blank line to complete data
entry.
If the data are separated with simple spaces, you can simply
copy and paste.
If the data are separated with some other character, you need
to tell R which character is used as the separator.
For example, a CSV (comma-separated values), uses
commas to separate the data items. To tell R you are using
this separator, simply add an extra part to your command like
so:
obj.name <- scan( sep = ’,’ )
L. V. Rao
> with(df,table(Residence,Score>65))
> with(mtcars,table(am,cyl))
cyl
am 4 6 8
0 3 4 12
1 8 3 2
> with(mtcars,tapply(mpg,list(am,cyl),mean))
4 6 8
0 22.900 19.12500 15.05
1 28.075 20.56667 15.40
>
A matrix is a two-dimensional R object, which can hold only data of same type (integers,
numeric, character). The dimensions are called rows and columns.
There are atleast three ways of creating a matrix:
matrix() function
The basic function used to create a matrix is the matrix() function. It requires at least two
arguments, the first of which is the data(usually, a vector) out of which a matrix is to be
created and the either the number of rows or the number of columns of the matrix should be
specified. The matrix() function fills the elements of the matrix by column-wise, by default.
The syntax for matrix() function is:
matrix() Function
where
x is a vector
nrow the number of rows of the matrix
ncol the number of columns of the matrix
byrow logical. If TRUE, elements are filled row-wise
dimnames specifies the names of the rows and columns of the matrix
To illustrate creating a matrix using matrix() function, first let use create some vector
consisting of values from 1 to 12, say. Then, use this vector as the data of our matrix.
> x <- 1:12
> x
[1] 1 2 3 4 5 6 7 8 9 10 11 12
>
> mat.x <- matrix(x, nrow = 3)
>
> mat.x
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
>
Note that, by default, the matrix() function fills the entries of the matrix column-wise.
Also, we used only the nrow= option of the matrix() function. From the length of the vector
and the size of the row dimension, it determines the number of columns. Let us now try the
ncol= option instead with the same data.
1
September 16, 2020
R - Matrices
From the above output, we observe that both the function calls returns the same matrix.
This is because the matrix() function fills the entries of the matrix column-wise. To force
the matrix() function fill the entries of the matrix by row, set the byrow= option to TRUE.
> dim(mat.x2)
[1] 3 4
The dim() function when invoked on a matrix returns a vector consisting of two values:
the first member represents the number of rows and the second, the number of columns.
You can give names to rows and columns of a matrix in a couple of ways: using dimnames=
option of the matrix function or using the rownames() and colnames() functions. The
dimnames= option is used to name the rows and columns of the matrix at the time of the
creation of the matrix. The dimnames= option expects a two-element list as its value( A list
is a data structure in R), whose first member is a vector consisting of the names of the rows
and the second member is also a vector containing the names of the columns.
Naming Rows and Columns using dimnames Option
> mat.x3 <- matrix( x, nrow = 3,
dimnames = list( c("R1","R2","R3"),
c("C1","C2","C3","C4")))
> mat.x3
C1 C2 C3 C4
R1 1 2 3 4
R2 5 6 7 8
R3 9 10 11 12
The rownames() and colnames() functions are used to set the names of rows and
columns of an already created matrix. To illustrate the use of these functions, let us create
a matrix, called mat.y, as given below:
2
September 16, 2020
R - Matrices
To set the names to the rows of the matrix mat.y, use the rownames() as in the following
example:
The rownames() function can also be used to get the names of row names of a matrix:
> rownames(mat.y)
[1] "R-1" "R-2" "R-3"
>
To set the names to the columns of the matrix mat.y, use the colnames() as in the
following example:
The colnames() function can also be used to get the names of row names of a matrix:
rbind() function
A matrix can be created out of several vectors of same type and size either by binding them
by row-wise on using the rbind() function or column-wise on using the cbind() function.
Let us create three vectors, say:
Let us now use the rbind() function to create a matrix out of the data already existing
3
September 16, 2020
R - Matrices
in the form of vectors x, y and z. Each vector becomes a row the matrix created and the
order of the vectors passed as parameters to the function determines the order of the rows
of the matrix. Also, the names of the vectors becomes the names of the rows of the matrix
created. The colnames() function can be used to set the column names of the matrix. The
rownames() function can be used to get the row names of the matrix as well as to change
the default names of the matrix.
rbind() Function
Remember that all the vectors used with the rbind() function must of the same type
and size. If they are of different sizes, a matrix will be created but it may not be the
desired matrix and further, the R system will output a warning message to remind us about
differences in sizes.
Use of rbind() Function with Vectors of Different Sizes
> x <- c(1:3) # length = 3
> y <- c(4:6) # length = 3
> z <- c(7,8) # length = 2
>
> rbind(x,y,z)
[,1] [,2] [,3]
x 1 2 3
y 4 5 6
z 7 8 7
Warning message:
In rbind(x, y, z) :
number of columns of result is not a
multiple of vector length (arg 3)
Note that recycling takes place in completing the last row elements.
cbind() function
The cbind() function can also be used to create a matrix using vectors of the same type and
size. However, the vectors used become the columns of the matrix and their names become
the names of the columns. The colnames() and rownames() functions can be used to modify
the names of the columns and rows respectively. The cbind() and rbind() functions can
also be used to add new columns and rows respectively to an existing matrix.
4
September 16, 2020
R - Matrices
cbind() Function
The functions rownames() and colnames() can still be used to set as well as get the
names the rows and columns of the matrices created using the cbind() function. For exam-
ple, you have a matrix defined as below:
to the matrix teams as a new row either at the beginning or at the end. You can do that as:
Appending a New Row at the End
> team4 <- c("Jalaja","Sailaja","Vanaja")
>
> # Adding as a last row
>
> teams.1 <- rbind(teams,team4)
> teams.1
[,1] [,2] [,3]
team1 "Sujatha" "Lalitha" "Kavitha"
team2 "Somaiah" "Rajaiah" "Ramaiah"
team3 "John" "Paul" "Hogg"
team4 "Jalaja" "Sailaja" "Vanaja"
>
5
September 16, 2020
R - Matrices
Subsetting Matrices
Remember that a matrix is a two-dimensional object. To fetch an element of a matrix
object, you require the row index and column index of that element. These indices must be
separated by a comma inside the indexing operator []. Suppose we have the matrix:
> x[2,3]
[1] 8
>
> x[3,]
[1] 3 6 9
>
> x[,2]
[1] 4 5 6
>
Note that the response to the commands x[2,3] or x[3,] or x[,2] are all vectors. If a
matrix is desired, then use the drop argument.
6
September 16, 2020
R - Matrices
Negative indexing is allowed with matrices as well. As we saw earlier that a row or a column
of a matrix is a vector. So, negative indexes for rows or columns of a matrix will drop the
corresponding rows and columns. For example, to drop the second from the matrix x, use
the command x[-2,]:
> x[-2,]
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 3 6 9
> x[,-2]
[,1] [,2]
[1,] 1 7
[2,] 2 8
[3,] 3 9
To drop the second row as well as the second column from the matrix x:
> x[-2,-2]
[,1] [,2]
[1,] 1 7
[2,] 3 9
Fetching a Submatrix
Having understood how to fetch a row, a column and a particular element of a matrix, let
us now consider how to get a submatrix of the given matrix. Suppose from the matrix x
defined above, we want to extract the submatrix
5 8
.
6 9
7
September 16, 2020
R - Matrices
This submatrix consists of all the elements except the first row and first column. So, the
command x[-1,-1] will do, to fetch the above submatrix.
> x[-1,-1]
[,1] [,2]
[1,] 5 8
[2,] 6 9
>
This can also be achieved in several different ways. Let us consider the command x[2:3,].
This command results in the submatrix
2 5 8
3 6 9
In this submatrix, we do not require the first column. Therefore, the command x[2:3,-1]
results in the desired submatrix.
> x[2:3,-1]
[,1] [,2]
[1,] 5 8
[2,] 6 9
>
In the above command you are omitting the column that is not required. Instead, you can
specify which columns are required. That is the command x[2:3,2:3] output the desired
submatrix.
> x[2:3,2:3]}
[,1] [,2]
[1,] 5 8
[2,] 6 9
>
Think about other equivalent commands that fetches the specified submatrix. Suppose you
want to extract the submatrix
1 4
.
3 6
This submatrix may be extracted using the command x[c(1,3),c(1,2)]:
> x[c(1,3),c(1,2)]
[,1] [,2]
[1,] 1 4
[2,] 3 6
>
8
September 16, 2020
R - Matrices
We may require to change one or more values of a matrix. First, let us consider modifying
the value of a single element in the matrix x. To modify the value of the element in the 2nd
row and 2nd column:
> ( x[2,2] <- 15 )
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 15 8
[3,] 3 6 9
>
Let us now consider modifying more than one value of a matrix. Suppose we want to modify
the values of the first two elements in the 2nd row of x as 12 and 15. These elements are
x[2,1:2], which is a vector. Assign the vector c(12, 15) to x[2,1:2].
Now, consider modifying the values of the 1st and 3rd elements in the first column of the
matrix x to 11 and 13.
> x[c(1,3),1] <- c(11,13)
> x
[,1] [,2] [,3]
[1,] 11 4 7
[2,] 12 15 8
[3,] 13 6 9
Now, consider modifying the values in the entire column of a matrix. For example, let modify
the values of the first column of x back to 1,2,3.
> x[,1] <- c(1,2,3)
> x
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 15 8
[3,] 3 6 9
9
September 16, 2020
R - Matrices
Note that the submatrix elements are filled column-wise by default. Instead, if we want
to modify the elements row-wise use the byrow= option of the matrix() function. Suppose,
you want to modify the values of the above submatrix with (18, 14, 17, 19), such that
the first row elements of the new submatrix are (18, 14) and the second row (17,19).
> y <- c(18,14,17,19)
> x[1:2,2:3] <- matrix(y, nrow=2, byrow=TRUE)
> x
[,1] [,2] [,3]
[1,] 1 18 14
[2,] 2 17 19
[3,] 3 6 9
You learned that the members of a vector can be filtered either by numerical indices or
negative indices or logical indices or names. Subsetting of matrices can also be achieved
using any of these procedures. We have just seen subsetting matrices using numeric and
negative indices. Let us now consider subsetting matrices using the names of the rows and
columns. To illustrate this let us create a matrix using three vectors x, y and z and then
use the rbind() function to create a matrix.
> x <- c(1,2,3)
> y <- c(11,22,33)
> z <- c(12,23,34)
>
> ( row.lab <- rbind(x,y,z) )
[,1] [,2] [,3]
x 1 2 3
y 11 22 33
z 12 23 34
>
Now, single rows of the above matrix row.lab can be fetched as follows:
> row.lab["x",]
[1] 1 2 3
>
> row.lab["z",]
[1] 12 23 34
10
September 16, 2020
R - Matrices
The elements of a matrix are stored in the memory one column after another in contiguous
memory locations. This means that, the members of a matrix can also be accessed using
single numeric indexing.
> matx
C1 C2 C3 C4
R1 1 2 3 4
R2 5 6 7 8
R3 9 10 11 12
>
Operations on Matrices
Element-Wise Operations
Let A and B be two matrices of same dimension. The operators
∧
+ − r ∗
when used with matrices of same dimension, they perform the required operations on the
corresponding elements of the matrices and results in new matrix of the same dimension.
These operations are usually referred to as element-wise or element-by-element operations.
11
September 16, 2020
R - Matrices
Element-Wise Operations
Operator A op B
M eaning
+ A+B Addition of corresponding elements of A and B
− A−B Subtracts the elements of B from the corresponding
elements of A
/ A/B Divides the elements of A by the corresponding el-
ements of B
∗ A∗B Multiplies the elements of A by the corresponding
elements of B
∧
(−1) A∧ (−1) Results in a matrix whose elements are reciprocals
of A
A% ∗ %B,
12
September 16, 2020
R - Matrices
13
September 16, 2020
R - Matrices
is.matrix() function
To verify whether a give R object is a matrix object, use the is.matrix() function. Let us
create dataframe and invoke the is.matrix() on that object.
as.matrix() function
Let us now convert the dataframe object into a matrix object using the as.matrix() function
and then again invoke the is.matrix() on the resulting object to verify whether the function
successfully converted the dataframe object into a matrix object.
The rowSums() and colSums() functions can be used to compute the sums of the rows and
columns in matrix object. We know that the row sums of a transition probability matrix
must each equal to 1. So, let us create a TPM and very whether its rows sums to 1.
1
September 22, 2020
R - Matrices
To achieve the matrix multiplication as in Linear Algebra, we have to use the operator
%*%. We know that, the if P is a TPM, the Pn is also a stochastic matrix, for all positive
integer powers.
It is easy to observe that the Markov Chain corresponding to the given TPM is finite,
irreducible and aperiodic. we know that, for such a Markov chain, stationary distribution
exists. Having P3 , we can compute, P6 , P12 and P24 . Print the contents of P24 matrix and
confirm that stationary distribution has been obtained and then verify the matrix is TPM.
2
September 22, 2020
R - Matrices
#
# Illustrating the paste() and append() function
#
> x <- sample(10:20,12,replace=T)
> x
[1] 11 18 13 12 12 19 20 11 10 15 14 12
> x <- matrix(x,nrow=4)
> x
[,1] [,2] [,3]
[1,] 11 12 10
[2,] 18 19 15
[3,] 13 20 14
[4,] 12 11 12
>
> apply(x,1,sum)
[1] 33 52 47 35
>
> cbind(x,apply(x,1,sum))
[,1] [,2] [,3] [,4]
[1,] 11 12 10 33
[2,] 18 19 15 52
[3,] 13 20 14 47
[4,] 12 11 12 35
>
3
September 22, 2020
R - Matrices
> paste("Stu",1:4)
[1] "Stu 1" "Stu 2" "Stu 3" "Stu 4"
> paste("Stu",1:4,sep="")
[1] "Stu1" "Stu2" "Stu3" "Stu4"
>
> marks <- x
> marks
[,1] [,2] [,3]
[1,] 11 12 10
[2,] 18 19 15
[3,] 13 20 14
[4,] 12 11 12
> rownames(marks) <- paste("Stu",1:4,sep="")
> marks
[,1] [,2] [,3]
Stu-1 11 12 10
Stu-2 18 19 15
Stu-3 13 20 14
Stu-4 12 11 12
> colnames(marks) <- paste("P",1:3,sep="")
> marks
P1 P2 P3
Stu1 11 12 10
Stu2 18 19 15
Stu3 13 20 14
Stu4 12 11 12
>
> Total <- apply(marks,1,sum)
> Total
Stu1 Stu2 Stu3 Stu4
33 52 47 35
>
> cbind(marks,Total)
P1 P2 P3 Total
Stu1 11 12 10 33
Stu2 18 19 15 52
Stu3 13 20 14 47
Stu4 12 11 12 35
>
> marks <- cbind(marks,Total)
> marks
P1 P2 P3 Total
Stu1 11 12 10 33
Stu2 18 19 15 52
Stu3 13 20 14 47
Stu4 12 11 12 35
>
4
September 22, 2020
Exercise-1:
(b) What are the standard deviations of males and females scores?
Solutions:
(a)
> f.score
[1] 77 66 88 67 59
>
> m.score
[1] 80 60 82 84 62
>
> cor(f.score,m.score)
[1] 0.6618506
>
(b)
> with(df,tapply(Score,Gender,sd))
Female Male
11.28273 11.61034
>
(c)
> df[df$Score==max(df$Score),"Student"]
[1] "Vanaja"
>
(d)
> df[df$Gender=="Male","Score"]
[1] 80 60 82 84 62
> max(df[df$Gender=="Male","Score"])
[1] 84
> df$Score==max(df[df$Gender=="Male","Score"])
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
> df[df$Score==max(df[df$Gender=="Male","Score"]),]
> df[df$Score==max(df[df$Gender=="Male","Score"]),"Student"]
[1] "Tharun"
>
R-Dataframes
Outline
..What is a data frame?
..How to create a data frame?
..... data.frame(),
..... as.data.frame(),
..... read.csv()
..How to subset a data frame?
..Data Fetching
..... a single value,
..... a row,
..... a column
..Modifying data frame structure
..... Changing a value,
..... adding/deleting a row
..... adding/deleting a column
..How to write a data frame to a file
A data frame is two-dimensional data structure such that all columns are of same length
and within each column the data values must be of the same atomic type.
The basic function used to create a data frame is data.frame(). Data frames are usually
created out of a .CSV file, an Excel file, or imported from statistical packages such as SPSS,
SAS and so on.
If the data is in the form of several vectors of same length, use the function data.frame()
function. Consider the following examples:
> data.frame(1:5,11:15)
X1.5 X11.15
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
So, when we create a data frame with unnamed vectors, the R system tries to give default
names which turns out to be very clumsy.
We can give names to the vectors inside the data.frame() function while creating the data
frame, which turn out be column names of the data frame, as shown below:
1
September 23, 2020
R-Dataframes
> data.frame(a=c(12,32,15,15,24),b=c(22,23,21,24,23))
a b
1 12 22
2 32 23
3 15 21
4 15 24
5 24 23
>
You can also create a dataframe by first creating the data vectors and then using them within
the data.frame() function.
> a <-c(12,32,15,15,24)
> b <- c(22,23,21,24,23)
> df <- data.frame(a,b)
> df
a b
1 12 22
2 32 23
3 15 21
4 15 24
5 24 23
>
In practice, we will create a data frame by reading data from a disk file. If the data is in a
file, use the function read.csv() function to create a data frame.
Suppose we have text file in our working directory with the name data1.txt.
Data File
Student Gender Residence Score
Sarayu Female Resident 77
Rayudu Male Resident 80
Gowtam Male Nonresident 60
Vasant Male Resident 82
Vinuta Female Nonresident 66
Vanaja Female Resident 88
Tharun Male Resident 84
Pavani female resident 77
Venkat male nonresident 62
Janaki female nonresident 59
2
September 23, 2020
R-Dataframes
> df
Student Gender Residence Score
1 Sarayu Female Resident 77
2 Rayudu Male Resident 80
3 Gowtam Male Nonresident 60
4 Vasant Male Resident 82
5 Vinuta Female Nonresident 66
6 Vanaja Female Resident 88
7 Tharun Male Resident 84
8 Pavani female resident 77
9 Venkat male nonresident 62
10 Janaki female nonresident 59
> str(df)
’data.frame’: 10 obs. of 4 variables:
$ Student : Factor w/ 10 levels "Gowtam","Janaki",..: 5 4 1 8 10 7 6 3 9 2
$ Gender : Factor w/ 4 levels "female","Female",..: 2 4 4 4 2 2 4 1 3 1
$ Residence: Factor w/ 4 levels "nonresident",..: 4 4 2 4 2 4 4 3 1 1
$ Score : int 77 80 60 82 66 88 84 77 62 59
>
3
September 23, 2020
R-Dataframes
> str(df)
’data.frame’: 10 obs. of 4 variables:
$ Student : chr "Sarayu" "Rayudu" "Gowtam" "Vasant" ...
$ Gender : chr "Female" "Male" "Male" "Male" ...
$ Residence: chr "Resident" "Resident" "Nonresident" "Resident" ...
$ Score : int 77 80 60 82 66 88 84 77 62 59
>
Now we observe that the variables Student, Gender, and Residence are character variables.
Note that, there are some case differences among the values of the variables Gender and
Residence. You need to convert them to uniform case, before you attempt any analysis
using them. Consider the Resident variable:
> df[8:10,3]
[1] "resident" "nonresident" "nonresident"
> c("Resident",rep("Nonresident",2))
[1] "Resident" "Nonresident" "Nonresident"
>
> df[8:10,3] <- c("Resident",rep("Nonresident",2))
> df[,3]
[1] "Resident" "Resident" "Nonresident" "Resident"
[5] "Nonresident" "Resident" "Resident" "Resident"
[9] "Nonresident" "Nonresident"
>
> df[,3] <- as.factor(df$Residence)
> str(df)
’data.frame’: 10 obs. of 4 variables:
$ Student : chr "Sarayu" "Rayudu" "Gowtam" "Vasant" ...
$ Gender : chr "Female" "Male" "Male" "Male" ...
$ Residence: Factor w/ 2 levels "Nonresident",..: 2 2 1 2 1 2 2 2 1 1
$ Score : int 77 80 60 82 66 88 84 77 62 59
>
Question your dataframe to know the number of levels of the factor variable and the number
of observations on each level.
> levels(df$Residence)
[1] "Nonresident" "Resident"
> table(df$Residence)
Nonresident Resident
4 6
>
Note that the variable Residence is now a factor variable and is having two levels.
4
September 23, 2020
R-Dataframes
Exercise
Try to correct the data errors in the Gender variable and then change it into a factor
variable.
> str(df)
’data.frame’: 10 obs. of 4 variables:
$ Student : chr "Sarayu" "Rayudu" "Gowtam" "Vasant" ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 2 1
$ Residence: Factor w/ 2 levels "Nonresident",..: 2 2 1 2 1 2 2 2 1 1
$ Score : int 77 80 60 82 66 88 84 77 62 59
>
You can use as.data.frame() function to create a data frame out of a matrix.
Data frames always have its observations(rows) named as "1","2","3",···. You can check
this using the rownames() function.
> rownames(df)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
>
The names() or colnames() functions can be used to fetch variables or columns of data
frame.
> names(df)
[1] "Student" "Gender" "Residence" "Score"
>
> colnames(df)
[1] "Student" "Gender" "Residence" "Score"
>
The names() function can also be used to assign names to columns of a data frame.
> names(df)[2]
[1] "Gender"
> names(df)[2] <- "Sex"
> names(df)
[1] "Student" "Sex" "Residence" "Score"
>
Now change the column name from Sex to Gender using the colnames() function.
5
September 23, 2020
R-Dataframes
Unlike matrices, you cannot delete the row names of a data frame.
The name Vasant appears in the 4th row and the Residence is the 3rd variable as per our
data frame. so, to fetch the required info, issue the command:
> df[4,3]
[1] Resident
Levels: Nonresident Resident
>
While dealing with large data frames, it is easy to remember the column names rather than
their numbers. The R System supports fetching the values in data frame using the names of
the columns.
A column name or a variable can be fetched using a $ symbol with the data frame name.
For example, the Student variable of the data frame df can be accessed as df$Student.
> df$Student
[1] "Sarayu" "Rayudu" "Gowtam" "Vasant" "Vinuta" "Vanaja" "Tharun"
[8] "Pavani" "Venkat" "Janaki"
>
To know whether there is a student named Vasant compare the string ”Vasant” to df$Student.
> df$Student=="Vasant"
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
>
Note that, 4th element of the result of the above command is TRUE and every other element is
FALSE. So, when we pass this vector as the row index for the data frame df and supplementing
the column index as ”Residence”, we get:
6
September 23, 2020
R-Dataframes
> df[df$Student=="Vasant","Residence"]
[1] Resident
Levels: Nonresident Resident
>
Fetching a Row
To fetch an observation( or case or record or row), we leave the second dimension empty as
in the case of matrices. For example, 8th row of the data frame can be accessed using the
commands
> df[8,]
Student Gender Residence Score
8 Pavani Female Resident 77
>
> df[df$Student=="Pavani",]
Student Gender Residence Score
8 Pavani Female Resident 77
>
> df[df=="Pavani",]
Student Gender Residence Score
8 Pavani Female Resident 77
>
Suppose, Pavani’s score was wrongly noted as 77 instead of 67. You modify Pavani’s
score using the command
> ## fetching Pavani’s Score
>
> df[df$Student=="Pavani","Score"]
[1] 77
>
> # Modifying Pavan’s Score
>
> df[df$Student=="Pavani","Score"] <- 67
>
> # View modified record
>
> df[df$Student=="Pavani",]
Student Gender Residence Score
8 Pavani Female Resident 67
>
7
September 23, 2020
R-Dataframes
8
September 23, 2020
R-Dataframes
How many Nonresident female students are there? Who are they?
> df$Gender=="Female"&df$Residence=="Nonresident"
[1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
> sum(df$Gender=="Female"&df$Residence=="Nonresident")
[1] 2
>
> df[df$Gender=="Female"&df$Residence=="Nonresident",]
Student Gender Residence Score
5 Vinuta Female Nonresident 66
10 Janaki Female Nonresident 59
>
> with(df,tapply(Score,Gender,mean))
Female Male
71.4 73.6
>
> with(df,tapply(Score,Residence,mean))
Nonresident Resident
61.75000 79.66667
>
> with(df,tapply(Score,list(Gender,Residence),mean))
Nonresident Resident
Female 62.5 77.33333
Male 61.0 82.00000
>
Nonresident Resident
4 6
9
September 23, 2020
R-Dataframes
> table(df$Gender)
Female Male
5 5
>
Exercise-2
(a) What is the correlation between scores of males and females?
(b) What are the standard deviations of males and females scores?
(c) Find the name of the student having maximum score?
(d) Among males, who scored max marks?
Fetching a column
A column in a data frame is a variable. Unlike matrices and arrays, data frames are not
internally stored as vectors. They are stored as list of vectors.
We can use numeric indices, names and logical vectors for selection of variables as with
matrices. We can also select a variable by inserting a $ symbol in between the data frame
name and column name, in that order.
> df[,1]
[1] "Sarayu" "Rayudu" "Gowtam" "Vasant" "Vinuta" "Vanaja" "Tharun"
[8] "Pavani" "Venkat" "Janaki"
>
> df[,"Student"]
[1] "Sarayu" "Rayudu" "Gowtam" "Vasant" "Vinuta" "Vanaja" "Tharun"
[8] "Pavani" "Venkat" "Janaki"
>
> df[,c(TRUE,rep(FALSE,3))]
[1] "Sarayu" "Rayudu" "Gowtam" "Vasant" "Vinuta" "Vanaja" "Tharun"
[8] "Pavani" "Venkat" "Janaki"
>
> df$Student
[1] "Sarayu" "Rayudu" "Gowtam" "Vasant" "Vinuta" "Vanaja" "Tharun"
[8] "Pavani" "Venkat" "Janaki"
>
Note that, the output is a vector in all the above cases. If we want to fetch a column as a
data frame, then use the drop=FALSE option.
10
September 23, 2020
R-Dataframes
> df[,1,drop=FALSE]
Student
1 Sarayu
2 Rayudu
3 Gowtam
4 Vasant
5 Vinuta
6 Vanaja
7 Tharun
8 Pavani
9 Venkat
10 Janaki
You can also use square brackets with a single index to get a column of a data frame, since
columns of data frame are stored as lists in the memory.
> df["Student"]
Student
1 Sarayu
2 Rayudu
3 Gowtam
4 Vasant
5 Vinuta
6 Vanaja
7 Tharun
8 Pavani
9 Venkat
10 Janaki
>
>
> df[["Student"]]
[1] "Sarayu" "Rayudu" "Gowtam" "Vasant" "Vinuta" "Vanaja" "Tharun"
[8] "Pavani" "Venkat" "Janaki"
>
> df[["Student"]][1:5]
[1] "Sarayu" "Rayudu" "Gowtam" "Vasant" "Vinuta"
>
11
September 23, 2020
R-Dataframes
Methods of Subsetting
> "["(df,c("Student","Score"))
Student Score
1 Sarayu 77
2 Rayudu 80
3 Gowtam 60
4 Vasant 82
5 Vinuta 66
6 Vanaja 88
7 Tharun 84
8 Pavani 67
9 Venkat 62
10 Janaki 59
>
”[” is a function with the first argument being the data frame and the second argument is a
column index.
12
September 23, 2020
R-Dataframes
Modifying Dataframes
Adding One Observation
> df1
Math Phy Chem
1 30 25 20
2 26 24 22
3 23 23 19
4 21 21 23
5 24 24 22
6 25 25 23
>
(1) If the Data frame contains only numeric values and the rows have default names:
Alternatively, you can also add anew row to the dataframe using the rbind() function as
follows(assuming the original dataframe):
(2) If the Data frame contains only numeric values and the rows are labelled:
13
September 23, 2020
R-Dataframes
(3) Suppose different columns of the dataframe contains different atomic types:
Create a new data frame with column names being same as those of the old data frame and
also in the same order. Now use the rbind() function to add both these data frames into
single data frame. To be able to bind the new data frame with the old data frame, you have
to make sure that the column names match in both the data frames exactly, including the
case.
14
September 23, 2020
The read.spss function in the foreign package reads all versions of SPSS files, both .sav and .por types.
library(foreign)
There is no canned function to write out a completed SPSS dataset, but there are two auxiliary functions
in the foreign package that allow users to write out a text data file and then an input syntax file that will
read the data in and make the "right" variable and value labels.
... writeForeignSPSS() takes three arguments, first is the R data frame you
want to write out, the second is the name of a data file to which the data will
be written and the third is the name of a code file to which the code to input
Stata
... The read.dta function in the foreign package reads in Stata datasets saved in formats earlier than
Stata 13.
library(foreign)
... To read Stata files from version 13 or later, you can use the read.dta13 function in the readStata13
package. First, you have to install the package:
install.packages('readstata13')
library(readstata13)
dat <-read.dta13('xyz.dta', nonint.factors=T)
write.dta() writes a Stata .dta file of the dataset. The benefit here is that factors remain defined as
variables with labels in Stata. Those attributes go away in the text files.
-------------------------------------
tidyverse - haven
... Haven enables R to read and write various data formats used by other statistical packages.
Usage
Are tibbles, which have a better print method for very long and very wide files.
Translate value labels into a new labelled() class, which preserves the original semantics and can easily
be coerced to factors with as_factor(). Special missing values are preserved.
Dates and times are converted to R date/time classes. Character vectors are not converted to factors.
Read SPSS (.sav, .zsav, .por) files. Write .sav and .zsav files.
------------------------------------------------
install.packages(“haven”)
library(haven)
write_sav(object, "filename.sav")
For example, if you wanted to download the package that would allow you to install
install.packages('sas7bdat')
R can read data from a wide variety of sources and in a wide variety of formats.
The foreign package contains methods to read SAS permanent datasets3 (SAS7BDAT files) using
read.ssd, Stata DTA files with read.dta, and SPSS data files with read.spss. Each of these files can be
written with write.foreign.
read any Excel file on any system. It provides a choice of functions for reading Excel files: spreadsheets
can be imported with read.xlsx and read.xlsx2, which do more processing in R and in Java, respectively.
Many statistical packages (SAS, SPSS) can save data as an EXCEL file.
Import any type of data into R by using EXCEL and saving there
Once the comma delimited file is created using the "Save As" feature
in EXCEL you can import it into R using either the read.table() or the
read.csv() function. Before importing, determine which separator was used in the ".csv" file (comma or
semi-colon). Then:
Option 1: The separator is a comma (,) and NO headers
Select the table from the excel file, copy, go to the R Console and type:
-----------------------------------------------
install.packages(“openxlsx”)
library(openxlsx)
Dr. L. V. Rao
Arguments
x the data to be written out, usually an atomic vector.
file a connection, or a character string naming the file to write
to. If ””, print to the standard output connection.
ncolumns the number of columns to write the data in.
append if TRUE the data x are appended to the connection.
sep a string used to separate columns. Using sep =”\t” gives
tab delimited output; default is ” ”.
Importing Data Into R - Dr. L. V. Rao,, August 25, 2019 3/1
write() function
> write.table(mat,
> write.table(mat) quote=FALSE)
"Maths" "Phy" "Chem" Maths Phy Chem
"1" 100 95 92 1 100 95 92
"2" 98 82 84 2 98 82 84
"3" 89 79 81 3 89 79 81
"4" 95 88 80 4 95 88 80
> >
> write.table(mat, > write.table(mat,
file="temp.txt") file="temp1.txt",
quote=FALSE)
"dates","prices"
"3/27/1995",11.1
"4/3/1995",7.9
"4/10/1995",1.9
"4/18/1995",7.3
Dr. L. V. Rao
SPSS
> library(foreign)
> xyz <- read.spss("xyz.sav", to.data.frame = T)
MINITAB
> library(foreign)
> xyz <- as.data.frame(read.mtp("xyz.mtp"))
SAS
> library(foreign)
> xyz <- read.xport("xyz")
SYSTAT
> library(foreign)
> xyx <- read.systat("xyz.syd", to.data.frame = T)
Dr. L. V. Rao
For locales which use a character other than the period (.) as
a decimal point, the dec= argument can be used to specify an
alternative.
You can control which lines are read from your input source
using the skip= argument that specifies a number of lines to
skip at the beginning of your file, and the nrows= argument
which specifies the maximum number of rows to read.
For very large inputs, specifying a value for nrows= which is
close to but greater than the number of rows to be read may
provide an increase in speed.
"Rama Rao";12;15;12;13
"Subba Rao";14;15;15;15
"Usha Rani";13;12;15;14
"Yohan Babu";11;11;12;11
"Thilak";12;14;15;11
> read.table("noheadtab-com.txt")
V1 V2 V3 V4 V5
1 Rama Rao 12 15 12 13
2 Subba Rao 14 15 15 15
3 Usha Rani 13 12 15 14
4 Yohan Babu 11 11 12 11
5 Thilak 12 14 15 11
>
> read.table("noheadtab-com.txt",comment.char="#")
V1 V2 V3 V4 V5
1 Rama Rao 12 15 12 13
2 Subba Rao 14 15 15 15
3 Usha Rani 13 12 15 14
4 Yohan Babu 11 11 12 11
5 Thilak 12 14 15 11
>
> read.csv2("headcolmiss.txt")
Student Prob Dist Estn Prog
1 Rama Rao 12 NA 12 13
2 Subba Rao 14 15 15 15
3 Usha Rani 13 12 15 14
4 Yohan Babu 11 11 12 11
5 Thilak 12 14 15 NA
>
Reading Data into R - Dr. L. V. Rao,, August 20, 2019 26 / 29
read.delim() function
read.delim( file,
header = TRUE,
sep = "\t",
quote = "\"",
dec = ".",
fill = TRUE,
comment.char = "", ...)
read.delim2( file,
header = TRUE,
sep = "\t",
quote = "\"",
dec = ",",
fill = TRUE,
comment.char = "", ...)
read.delim() \t ”.”
read.delim2() \t ”,”
Dr. L. V. Rao
12 13 14 15
14 12 12 12
15 15 13 12
14 12 12 12
15 14 14 12
> read.fwf("fwf-tab.txt",widths=c(2,-1,2,-1,2,-1,2))
V1 V2 V3 V4
1 12 13 14 15
2 14 12 12 12
3 15 15 13 12
4 14 12 12 12
5 15 14 14 12
12131415
14121212
15151312
14121212
15141412
> read.fwf("fwf.txt",width=c(2,2,2,2))
V1 V2 V3 V4
1 12 13 14 15
2 14 12 12 12
3 15 15 13 12
4 14 12 12 12
5 15 14 14 12
>
12131415
14121212
15151312
14121212
15141412
Since the county names contain blanks and are not surrounded
by quotes, read.table() will have difficulty reading the data.
However, since the names are always in the same columns, we
can use read.fwf() function.
The commas in the population values will force read.fwf() to
treat them as character values, and, like read.table(), it will
convert them to factors, which may prove inconvenient later.
If we wanted to extract the state values from the county
names, we might want to suppress factor conversion for these
values as well, and as.is=TRUE will be used.
Assuming that the data is stored in a file named city.txt, the
values could be read as follows:
Dr. L. V. Rao
A CSV file is just a normal text file that commonly begins with a
header line listing the names of the variables, each separated by a
comma. The remainder of the file after the header row is expected
to consist of rows of data that record the observations. For each
observation, the fields are separated by commas, delimiting the
actual observation of each of the variables.
Data is often supplied in comma-separated-values (.csv) format,
which is a text file that separates data with special text characters
called delimiters. Files in .csv format can be opened in most
spreadsheet applications. Spreadsheet data should be saved in .csv
format before importing into R.
"dates","prices"
"3/27/1995",11.1
"4/3/1995",7.9
"4/10/1995",1.9
"4/18/1995",7.3
L. V. Rao
if(condition)
{
# do something
}
The block of code associated with if gets executed only if the the
If the condition evaluates to TRUE, then only the associated block
of code gets executed.
The curly braces around condition are mandatory.
The braces are optional when the body of if has ONLY one
statement to be executed.
The if statement, with or without else, tests a single logical
statement; it is not an element-wise (vector) function.
n <- 6
m <- 3
if(n %% m == 0)
{
print(paste(n,"is divisible by",m ))
}
if(cond)
{
# do something
} else {
# do somthing else
}
n <- 7
m <- 3
if(n %% m == 0)
{
print(paste(n,"is divisible by",m ))
} else {
print(paste(n,"is NOT divisible by",m ))
}
For some recoding tasks, the ifelse function may be more useful
than manipulating logical variables directly. Suppose we have a
variable called group that takes on values in the range of 1 to 5,
and we wish to create a new variable that will be equal to 1 if the
original variable is either 1 or 5, and equal to 2 otherwise.
# Illustrating ifelse
# giving discount
bill <- c(12500, 10131, 567, 8999)
bill.amt <- ifelse(bill > 10000, bill * 0.8, bill)
bill.amt
[1] 10000.0 8104.8 567.0 8999.0
if(cond1){
expr1
} else if(cond2){
expr2
}
else{
expr3
}
#computing factorial
fact <- 1
for(k in 1:5)
{
fact <- fact * k
print(fact)
}
while(cond) expr
> x[x %% 2 == 0]
> x
[1] 86 100 32 68 36 52 96 96 8 56
[11] 10 82 64 58 38 66 46 86 18 78
[21] 58 82 52 40 10
repeat{
statement
}
x <- 1
repeat
{
print(x)
x = x+1
if (x == 6)
{
break
}
}
if (condition) {
break
}
x <- 1:5
for (val in x)
{
if(val == 3)
{
break
}
print(val)
}
[1] 1
[1] 2
if( condition )
{
next
}
x <- 1:5
for( val in x )
{
if (val == 3)
{
next
}
print(val)
}
[1] 1
[1] 2
[1] 4
[1] 5
Dr. L. V. Rao
na.last = NA
> mat1[sort(row.names(mat1)),]
Maths Phy Chem
Samata 95 88 80 Sort the matrix
Sarala 95 82 84 by column names:
Sarayu 100 95 92
Saroja 89 79 NA > mat1[,sort(colnames(mat1))]
> Chem Maths Phy
Sarayu 92 100 95
Sarala 84 95 82
Saroja NA 89 79
Samata 80 95 88
>
> x.ord
[1] 12 NA 8 12 15 2 5
1 2 3 4 5 6 7
> order(x.ord)
[1] 6 7 3 1 4 5 2
> index of the NA value
> mat1
Maths Phy Chem
Sarayu 100 95 92
Sarala 95 82 84
Saroja 89 79 NA
Samata 95 88 80
> mat1
Maths Phy Chem
Sarayu 100 95 92
Sarala 95 82 84
Saroja 89 79 NA
Samata 95 88 80
> order(mat1[,1])
[1] 3 2 4 1
> mat1
Maths Phy Chem
Sarayu 100 95 92
Sarala 95 82 84
Saroja 89 79 NA
Samata 95 88 80
> order(mat1[,1])
[1] 3 2 4 1
> mat1[order(mat1[,1]),]
> mat1
Maths Phy Chem
Sarayu 100 95 92
Sarala 95 82 84
Saroja 89 79 NA
Samata 95 88 80
> order(mat1[,1])
[1] 3 2 4 1
> mat1[order(mat1[,1]),]
Maths Phy Chem
Saroja 89 79 NA
Sarala 95 82 84
Samata 95 88 80
Sarayu 100 95 92
>
Formula Notation - Dr. L. V. Rao,, August 28, 2019 14 / 19
order() function - matrix objects
> df[order(df[,2]),]
> df Student Maths Phy Chem
Student Maths Phy Chem 3 Saroja 89 79 NA
1 Sarayu 100 95 92 2 Sarala 95 82 84
2 Sarala 95 82 84 4 Samata 95 88 80
3 Saroja 89 79 NA 1 Sarayu 100 95 92
4 Samata 95 88 80 >
> > df[order(df[,2],df[,4]),]
> order(df[,2]) Student Maths Phy Chem
[1] 3 2 4 1 3 Saroja 89 79 NA
> 4 Samata 95 88 80
> df[order(df[,2]),2] 2 Sarala 95 82 84
[1] 89 95 95 100 1 Sarayu 100 95 92
> >
> x
[1] 12 NA 8 12 15 2 5
> x
[1] 12 NA 8 12 15 2 5
4 7 3 5 6 1 2 ranks of obs.
> x
[1] 12 NA 8 12 15 2 5
4 7 3 5 6 1 2 ranks of obs.
4 5 are tied obs.
> x
[1] 12 NA 8 12 15 2 5
4 7 3 5 6 1 2 ranks of obs.
4 5 are tied obs.
each receives
a rank of (4+5)/2 = 4.5
4.5 4.5
>
> rank(x)
[1] 4.5 7.0 3.0 4.5 6.0 1.0 2.0
L. V. Rao
February 4, 2021
Example
Suppose we have heights of 1000 Height Gender
individuals (500 males and 500 females) in 175 Male
the form of a data frame (one column for 155 Female
heights and another for gender), and we 180 Male
want to know the average heights of males 169 Male
and females. We can then group heights by 170 Female
gender and then calculate the average .. ..
heights for each level of the gender. . .
where
X is the variable that we want to have the function ap-
plied to, usually, it is a response variable.
INDEX describes how we want the X variable be split up
FUN is the function to be applied
> with(iris,tapply(Sepal.Length,Species,mean))
setosa versicolor virginica
5.006 5.936 6.588
>
apply() Functions
(Syntax and Examples)
L. V. Rao
November 4, 2020
1 Introduction
split-apply-combine paradigm
apply() function
When to use?
The apply() function is used, when it is required to perform
the same function either for all the rows or columns of a
matrix or a data frame.
Technically, apply() is for matrices, so it will attempt to
coerce a data frame into a matrix.
The apply() function is a general function, in that it works
with arrays, matrices, and data frames.
The apply() function works on anything that has dimensions.
X is a matrix or a dataframe
MARGIN is 1 or 2, according to whether we will operate on
rows or columns,
FUN is the function to be applied, and
fargs is an optional list of arguments to be supplied to
FUN.
Example
>
> (x <- matrix( 1:9, ncol=3 ) )
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
>
> apply( x, 1, sum )
[1] 12 15 18
>
Example
> list(1:3,25:30)
[[1]]
[1] 1 2 3
[[2]]
[1] 25 26 27 28 29 30
> lapply(list(1:3,25:30),median)
[[1]]
[1] 2
[[2]]
[1] 27.5
>
Data Visualization - Dr. L. V. Rao,, November 4, 2020 16 / 32
Introduction
Applying the Same Function to All Rows or Columns of a Matrix
lapply() function
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects
> as.numeric(lapply(list(1:3,25:27),median))
[1] 2 26
> birds
Sparrow Pigeon Dove
[1,] 14 19 10
[2,] 5 5 13
[3,] 12 6 19
>
> colSums(birds)
Sparrow Pigeon Dove
31 30 42
>
> rowSums(birds)
[1] 43 23 37
> birds
Sparrow Pigeon Dove
[1,] 14 19 10
[2,] 5 5 13
[3,] 12 6 19
>
>
> apply(birds, 2, sum)
Sparrow Pigeon Dove
31 30 42
>
>
> apply(birds, 1, sum)
[1] 43 23 37
>
>
Data Visualization - Dr. L. V. Rao,, November 4, 2020 21 / 32
Introduction
Applying the Same Function to All Rows or Columns of a Matrix
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects
lapply() function
tapply() function
sapply() function
sapply() function
The command sapply(df, class) will return the names and classes
(e.g., numeric, integer, or character) of each variable within a
dataframe.
tapply() function
> with(ToothGrowth, tapply(len, supp, mean) )
OJ VC
20.66333 16.96333
>
For tapply, as with split, the grouping variable is a factor or list of
factors. In the latter case, all combinations are computed before
splitting:
> with(ToothGrowth,
tapply(len, list(supp, dose), mean) )
0.5 1 2
OJ 13.23 22.70 26.06
VC 7.98 16.77 26.14
>
Data Visualization - Dr. L. V. Rao,, November 4, 2020 29 / 32
Introduction
Applying the Same Function to All Rows or Columns of a Matrix
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects
lapply() function
> x <-c(1:3,NA,4,5)
> sum(x)
[1] NA
> sum(x,na.rm=T)
[1] 15
> apply(matrix(x,nrow=1),1,sum, na.rm=T)
[1] 15
> apply(matrix(x,nrow=1),1,sum)
[1] NA
lapply() function
lapply() function
stem(x, scale = 1)
> stem(x)
0 | 112577
1 | 2234456677799
2 | 056789
>
0 | 112
0 | 577
1 | 22344
1 | 56677799
2|0
2 | 56789
>
The aplpack librray has more options for stem and leaf plots
> install.packages("aplpack")
> library(aplpack)
>
> stem.leaf(x)
1 | 2: represents 12
leaf unit: 1
n: 25
3 0* | 112
6 0. | 577
11 1* | 22344stem(x)
(8) 1. | 56677799
6 2* | 0
5 2. | 56789
>
Two sets of data values can be compared using a stem-and-leaf
plot.
The basic stem-and-leaf chart can be modified to display two data sets
The chart contains much more information than the basic chart produced by
stem() function. The left- and rightmost columns record position of the data
>
------------------------------------
x y
------------------------------------
3 211| 0* |112334 6
6 775| 0. |6 7
11 44322| 1* |2 8
6 0| 2* |00024 10
5 98765| 2. |56689 5
| 3* |
------------------------------------
n: 25 25
------------------------------------
barplot() Function in R
Barplot
The purpose of the barplot is to display the frequencies (or proportions) of levels of a factor
variable. For example, a barplot is used to pictorially display the frequencies (or proportions)
of individuals in various socio-economic(factor) groups(levels-high, middle, low). Such a plot
will help to provide a visual comparison among the various factor levels.
In barplot, factor-levels are placed on the x-axis and frequencies (or proportions) of various
factor-levels are considered on the y-axis. For each factor-level one bar of uniform width with
heights being proportional to factor level frequency (or proportion) is constructed.
The barplot() function is in the graphics package of the R’s System Library. The barplot()
function must be supplied at least one argument. The R help calls this as heights, which must
be either vector or a matrix. If it is vector, its members are the various factor-levels.
To illustrate barplot(), consider the following data preparation:
Notice that, the barplot() function places the factor levels on the x-axis in the lexicographical
order of the levels. Using the parameter names.arg , the bars in plot can be placed in the order
as stated in the vector, grades.
1
barplot() Function in R
2
barplot() Function in R
A bar plot with proportions on the y-axis can be obtained as follows:
The sizes of the factor-level names on the x-axis can be increased using “‘cex.names“‘
parameter.
The heights parameter of the barplot() could be a matrix. For example it could be matrix,
where the columns are the various subjects taken in a course, the rows could be the labels of
the grades. Consider the following matrix:
> gradTab
Algorithms Operating Systems Discrete Math
A- 13 10 7
A+ 10 7 2
B 4 2 14
B+ 8 19 12
C 5 2 5
3
barplot() Function in R
To draw a stacked bar, simply use the command:
4
barplot() Function in R
5
Boxplot
Data Visualization in R
Dr. L. V. Rao
September 9, 2019
## Default S3 method:
boxplot(x, ..., range = 1.5, width = NULL,
varwidth = FALSE, notch = FALSE,
outline = TRUE, names, plot = TRUE,
border = par("fg"), col = NULL, log = "",
pars = list(boxwex = 0.8, staplewex = 0.5,
outwex = 0.5), horizontal = FALSE,
add = FALSE, at = NULL)
with(mtcars,
plot(mpg ~ factor(cyl), col = c(2,3,4),
xlab = "No. of Cylinders",
main = "Mileage againt No. of Cylinders",
notch = TRUE ))
Warning message:
In bxp(list(stats = c(21.4, 22.8, 26, 30.4, 33.9, 17.8,
18.65, 19.7, :
some notches went outside hinges (’box’):
maybe set notch=FALSE
>
30
25
mpg
20
15
10
4 6 8
No. of Cylinders
Dr. L. V. Rao
December 7, 2020
and thus find a 95% confidence interval for µ going from 72.48 to
93.52.
Here we have assumed that σ is known.
Q-Q Plot - Dr. L. V. Rao,, December 7, 2020 4 / 17
We know that the normal distribution is symmetric, so that
N0.025 = −N0.975 ,
By default you get the minimum, the maximum, and the three
quartiles - the 0.25, 0.50, and 0.75 quantiles - so named because
they correspond to a division into four parts. Similarly, we have
deciles for 0.1, 0.2, · · · , 0.9, and centiles or percentiles. The
difference between the first and third quartiles is called the
interquartile range (IQR) and is sometimes used as a robust
alternative to the standard deviation.
0.4
0.2
0.0
sort(x)
> qqnorm(x)
As the title of the plot indicates, plots of this kind are also called
Q-Q plots (quantile versus quantile). Notice that x and y are
interchanged relative to the empirical c.d.f. - the observed values
are now drawn along the y-axis. You should notice that with this
convention the distribution has heavy tails if the outer parts of the
curve are steeper than the middle part.
3
2
Sample Quantiles
1
0
−1
−2
−3
−3 −2 −1 0 1 2 3
Theoretical Quantiles
L. V. Rao
October 2, 2019
- ,, October 2, 2019
- ,, October 2, 2019
Linear Regression Analysis
Objective:
Regression Analysis uses correlation as a basis to predict the
value of one variable from the value of a second variable or a
combination of several variables.
Terminology:
The variable whose value is to be predicted is called the
response variable( or dependent variable or criterion variable
or outcome variable) and is usually denote by Y .
The variable that is used to predict the value response variable
is called the predictor variable( or independent variable ) and
is denoted by x.
Linear Regression analysis provides information about the
strength of the relationship relationship between response
variable and predictor variable.
- ,, October 2, 2019
Assumptions
- ,, October 2, 2019
The equation for the simple linear regression model is given by
Yi = α + βxi + εi ,
α̂ = ȳ − β̂ x̄
Pn
i=1 (xi − x̄)(yi − ȳ ) sy
β̂ = P2 = rxy
i=1 (xi − x̄)
2 sx
- ,, October 2, 2019
Assessing the Prediction
Coefficient of Determination, R 2
It measures how much of variation in the response variable is
explained by the predictor variable.
Sum of Squares:
n
X n
X
SST = (yi − ȳ )2 , SSR = (yi − yˆi )2
i=1 i=1
SSR
R2 = 1 −
SST
0 ≤ R 2 ≤ 1, the closer the R 2 to 1, the better is the
prediction.
- ,, October 2, 2019
If SSR is same as SST, then this means that the prediction
using the regression equation is no different from prediction
using the mean of response variable. That is, R 2 = 0.
If SSR is smaller than SST, then R 2 will be greater than zero.
That is, prediction due to regression is better than prediction
from the mean of the response variable. The closer the R 2 to
1, the better is the prediction due to regression.
- ,, October 2, 2019
> str( mtcars )
’data.frame’: 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
>
- ,, October 2, 2019
It is reasonable to assume that the mileage per gallon decreases as
the weight of the car increases. This can be observed by plotting
the scatter plot weight versus mileage.
> plot( mtcars$wt, mtcars$mpg, pch = 20, col = "blue",
main = "mtcars data")
> abline( lm( mpg ~ wt, data = mtcars ), col = 2 )
- ,, October 2, 2019
lm() function
- ,, October 2, 2019
> slr.lm <- lm(mpg ~ wt, data = mtcars)
> summary(slr.lm)
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes:
0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
- ,, October 2, 2019
The five goodness-of-fit characterizations included in this
summary are the
residual standard error,
the multiple R-squared,
the adjusted R-squared,
the F-statistic and
the p-value associated with the F-statistic
Small p-values provide supporting evidence that the model
parameter is significant in the sense that omitting it(or
equivalent to setting it to zero) would result in a poorer
model.
The most important point is that the p-values associated with
the individual coefficients are telling us something about the
utility of each term in the model, while the p-value given in
the last line of the summary is telling us about the overall fit
quality of the model.
- ,, October 2, 2019
lm() function
α̂ = 37.2851 and SE (α̂) = 1.8776
ŷ = 37.2851 − 5.3445x
---
Signif. codes:
0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Code p-value
’***’ 0 < p < 0.001
’**’ 0.001 < p < 0.01
’*’ 0.01 < p < 0.1
’’ 0.01 < p < 1
- ,, October 2, 2019
> qqnorm( slr.lm$residuals,
main = "Normal Q-Q Plot for Residuals" )
> qqline( slr.lm$residuals, col=2 )
- ,, October 2, 2019
> mlr.lm <- lm( mpg ~ wt + cyl, data = mtcars)
> summary( mlr.lm )
Call:
lm(formula = mpg ~ wt + cyl, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.2893 -1.5512 -0.4684 1.5743 6.1004
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.6863 1.7150 23.141 < 2e-16 ***
wt -3.1910 0.7569 -4.216 0.000222 ***
cyl -1.5078 0.4147 -3.636 0.001064 **
---
Signif. codes:
0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
- ,, October 2, 2019
> qqnorm( mlr.lm$residuals,
main = "Normal Q-Q Plot for Residuals" )
> qqline( mlr.lm$residuals, col=2 )
- ,, October 2, 2019
Regression
- ,, October 2, 2019
Examples
Linear Regression
1. People often predict children’s future height by using their 2-year-old height. A common
rule is to double the height. The following Table contains data for eight people’s heights
as 2-year-olds and as adults.
Age 2 (in.) 39 30 32 34 35 36 36 30
Adult (in.) 71 63 63 67 68 68 70 64
Solution:
Scatter diagram
70
Adult Height
68
66
64
30 32 34 36 38
Height at age 2
1
December 23, 2020
Examples
The scatter diagram shows that the the heights at age 1 and adult heights are lin-
early related.
(b) Compute Pearson’s correlation coefficient between heights at age 2 and adult heights.
> r <- cor(age2, adult)
> r
[1] 0.9456109
>
The relationship between the variables heights at age 2 and adult heights is positive
and the strength of the linear relationship is 0.9456109.
(c) Use the above data set to build a simple linear regression model for adult height
using height at 2-years as the predictor.
> out.lm <- lm(adult ˜ age2)
> out.lm
Call:
lm(formula = adult ˜ age2)
Coefficients:
(Intercept) age2
35.1786 0.9286
>
Therefore, the regression equation of adult on age2 is given by
Scatter diagram
70
Adult Height
68
66
64
30 32 34 36 38
Height at age 2
2
December 23, 2020
Examples
(d) Interpret the estimate of regression coefficient and examine its statistical signifi-
cance.
The regression coefficient of height of adult on the height of aged 2 is 0.9285714286.
The rate of change in heights of adults for a unit change in the heights at age2 will
be in the interval ( 0.1123691, 1.7447738). (see the confidence interval for β below).
To assess the null hypothesis H0 : β = 0, which is interpreted as no linear relation-
ship between the response variable and the explanatory variable, the test statistic is
t = b/SEb and the corresponding p-value is obtained as follows:
if Ha : β 6= 0, pobs = 2 × P (T ≥ |t|),
3
December 23, 2020
Examples
(g) Create simple diagnostic plots for your model and identify possible outliers.
1.4
3
1.5
7 7
1.2
8 7
1.0
1
Standardized residuals
Standardized residuals
8
1.0
0.5
Residuals
0.8
0
0.0
0.6
−1
0.4
−1.0
0.2
−2
0.0
−2.0
3
0.4
3
7
0.4
1
8
1
0.5
8 8
Standardized residuals
0.3
0.3
Cook’s distance
Cook’s distance
0
7 7
0.2
0.2
−1
0.5 0.5
0.1
0.1
1
−2
3
Cook’s distance
0.0
0.0
0
1 2 3 4 5 6 7 8 0.0 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4
(h) Using the data, what is the predicted adult height for a 2-year-old who is 33 inches
tall?
4
December 23, 2020