R Software Project
R Software Project
R SOFTWARE
SUBMITTED BY SUBMITTED TO
ASHMITA NYAUPANE Dr. laxmi
INDEX
CONTENT PAGE
NO.
OBJECTIVE 1
INTRODUCTION TO R SOFTWARE 2–3
EXPERIMENT 1 4–7
EXPERIMENT 2 8 – 13
EXPERIMENT 3 14 – 18
EXPERIMENT 4 19 – 24
EXPERIMENT 5 25 – 27
EXPERIMENT 6 28 – 39
INTRODUCTION TO R PROGRAM
R operates in two modes: interactive and batch. The one typically used is interactive mode. In this mode,
you type in commands, R displays results, you type in more commands, and so on. On the other hand,
batch mode does not require interaction with the user. It’s useful for production jobs, such as when a
program must be run periodically, say once per day, because you can automate the process.
You can then execute R commands. The window in which all this appears is called the R console. As a
quick example, consider a standard normal distribution—that is, with mean 0 and variance 1. If a random
variable X has that distribution,
then its values are centered around 0, some negative, some positive, averaging in the end to 0. Now form
a new random variable Y = |X|. Since we’ve taken the absolute value, the values of Y will not be centered
around 0, and the mean of Y will be positive. Let’s find the mean of Y. Our approach is based on a simulated
example of N(0,1) variates.
> mean(abs(norm(100)))
> [1] 0.7194236
This code generates the 100 random variates, finds their absolute values, and then finds the mean of the
absolute values.
The [1] you see means that the first item in this line of output is item 1. In this case, our output consists of
only one line (and one item), so this is redundant. This notation becomes helpful when you need to read
voluminous output that consists of a lot of items spread over many lines. For example, if there were two
rows of output with six items per row, the second row would be labeled.
> norm(10)
Here, there are 10 values in the output, and the label [7] in the second row lets you quickly see that
0.6877001, for instance, is the eighth output item.You can also store R commands in a file. By convention,
R code files have the suffix .R or .r. If you create a code file called z.R, you can execute the contents of that
file by issuing the following command:
> source("z.R")
1.2 BATCH MODE
Sometimes it’s convenient to automate R sessions. For example, you may wish to run an R
script that generates a graph without needing to bother with manually launching R and
executing the script yourself. Here you would run R in batch mode
.As an example, let’s put our graph-making code into a file named z.R with the following contents:
The items marked with # are comments. They’re ignored by the R interpreter.Comments
serve as notes to-remind us and others what the code is doing, in a human-readable
format.
Here’s a step-by-step breakdown of what we’re doing in the preceding code:
• We call the pdf() function to inform R that we want the graph we create to be saved in the PDF file
xh.pdf.
• We call We call norm() (for random normal) to generate 100 N(0,1) random variates.
• We call hist() on those variates to draw a histogram of these values.
• We call dev.off() to close the graphical “device” we are using, which is
the file xh.pdf in this case. This is the mechanism that actually causes the file to be written to
disk.
We could run this code automatically, without entering R’s interactive mode, by invoking R
with an operating-system shell command (such as at the $ prompt commonly used in Linux
systems):
You can confirm that this worked by using your PDF viewer to display the saved histogram.
(It will just be a-plain-vanilla histogram, but R is capable of producing quite sophisticated
variations.)
EXPERIMENT 1
Let’s make a simple data set (in R parlance, a vector ) consisting of the numbers 1, 2, and 4, and
name it x:
> x <- c(1,2,4)
The standard assignment operator in R is <-. You can also use =, but this is discouraged, as it does
not work in some special situations. Note that there are no fixed types associated with variables.
Here, we’ve assigned a vector to x, but later we might assign something of a different type to
it. We’ll look at vectors and the other types in Section 1.4.
The c stands for concatenate. Here, we are concatenating the numbers 1, 2, and 4. More precisely,
we are concatenating three one-element vectors that consist of those numbers. This is because
any number is also considered to be a one-element vector.Now we can also do the following:
which sets q to (1,2,4,1,2,4,8) (yes, including the duplicates).Now let’s confirm that the data is
really in x. To print the vector to the screen, simply type its name. If you type any variable name
(or, more generally,any expression) while in interactive mode, R will print out the value of that
variable (or expression). Programmers familiar with other languages such as Python will find this
feature familiar. For our example, enter this:
>x
[1] 1 2 4
Yep, sure enough, x consists of the numbers 1, 2, and 4. Individual elements of a vector are
accessed via [ ]. Here’s how we can print out the third element of x:
> x[3]
[1] 4
As in other languages, the selector (here, 3) is called the index or subscript. Those familiar with
ALGOL-family-languages, such as C and C++, should note that elements of R vectors are
indexed starting from 1, not 0. Subsetting is a very important operation on vectors. Here’s an
example:
> x <- c(1,2,4)
> x[2:3]
[1The expression x[2:3] refers to the sub-vector of x consisting of elements 2 through 3, which
are 2 and 4 here.We can easily find the mean and standard deviation of our data set, as follows:
>mean(x) [1] 2.333333
> sad(x)
[1] 1.527525
This again demonstrates typing an expression at the prompt in order to print it. In the first line,
our expression is the function call mean(x). The return value from that call is printed
automatically, without requiring a call to R’s print() function.
If we want to save the computed mean in a variable instead of just printing it to the screen, we
could execute this code:
>y
[1] 2.333333
Comments are especially valuable for documenting program code, but they are useful in
interactive sessions, too, since R records the command history (as discussed in Section 1.6). If
you save your session and resume it later, the comments can help you remember what you were
doing.Finally, let’s do something with one of R’s internal data sets (these are
used for demos). You can get a list of these data sets by typing the following:
> data()
One of the data sets is called Nile and contains data on the flow of the Nile River. Let’s find
the mean and-standard deviation of this data set:
>mean(Nile)[1]919.35>sd(Nile)
> sd(Nile) [1] 169.2275] 2 4
> hist(Nile)
A window pops up with the histogram in it, as shown in Figure 1-1. This graph is bare-bones simple, but R
has all kinds of optional bells and whistles for plotting. For instance, you can change the number of bins by
specifying the breaks variable. The call hist(z,breaks=12) would draw a histogram of the data set z with 12
bins.You can also create nicer labels, make use of color, and make many other changes to create a more
informative and eye appealing graph. When you become more familiar with R, you’ll be able to construct
complex, rich color graphics of striking beauty.
Well, that’s the end of our first, five-minute introduction to R. Quit R by calling the q() function
(or alternatively by pressing CTRL-D in Linux or CMD-D on a Mac):
> q()
Save workspace image? [y/n/c]: n
That last prompt asks whether you want to save your variables so that you can resume work later.
If you answer y, then all those objects will be loaded automatically the next time you run R. This
is a very important feature, especially when working with large or numerous data sets. Answering
y here also saves the session’s command history. We’ll talk more about saving your workspace.
As a simple introduction, let’s define a function named (), whose purpose is to count the odd
numbers in a vector of integers. Normally, we would compose the function code using a text
editor and save it in a file, but in this quick-and-dirty example, we’ll enter it line by line in R’s
interactive mode. We’ll then call the function on a couple of test cases.
First, we told R that we wanted to define a function named odd count with one argument, x.
The left brace demarcates the start of the body of the function. We wrote one R statement per
line.
Until the body of the function is finished, R reminds you that you’re still in the definition by using
+ as its prompt, instead of the usual >. (Actually,+ is a line-continuation character, not a prompt
for a new input.) R resumes the > prompt after you finally enter a right brace to conclude the
function body.
After defining the function, we evaluated two calls to (). Since there are three odd numbers in
the vector (1,3,5), the call (c(1,3,5)) returns the value 3. There are four odd numbers in
(1,2,3,7,9), so the second call returns 4.
Notice that the modulo operator for remainder arithmetic is %% in R, as indicated by the
comment. For example, 38 divided by 7 leaves a remainder of 3:
> 38 %% 7
[1] 3
For instance, let’s see what happens with the following code:
for (n in x) {
if (n %% 2 == 1) k <- k+1
}
First, it sets n to x[1], and then it tests that value for being odd or even. If the value is odd,
which is the case-here, the count variable k is incremented.Then n is set to x[2], tested for being
odd or even, and so on.
By the way, C/C++ programmers might be tempted to write the preceding loop like this:
for (i in 1:length(x)) {
if (x[i] %% 2 == 1) k <- k+1
}
Here, length(x) is the number of elements in x. Suppose there are 25 elements. Then
1:length(x) means 1:25, in turn means 1,2,3,...,25. This code would also work (unless x were
to have length 0), but one of the major themes of R programming is to avoid loops if
possible; if not, keep loops simple. Look again at our original formulation:
for (n in x) {
if (n %% 2 == 1) k <- k+1
}
It’s simpler and cleaner, as we do not need to resort to using the length() function and array
indexing.
R functions will return the last value computed if there is no explicit return() call. However, this
approach must be used with care,
In programming language terminology, x is the formal argument (or formal parameter ) of the
function oddcount(). In the first function call in the preceding example, c(1,3,5) is referred to as
the actual argument. These terms allude to the fact that x in the function definition is just a
placeholder, whereas c(1,3,5) is the value actually used in the computation. Similarly, in
the second function call, c(1,2,3,7,9) is the actual argument.
EXPERIMENT 2
Aim: Implementation of vector and List data objects operations
THEORY:
With R, it’s Important that one understand that there is a difference between the actual
R object and the manner in which that R object is printed to the console. Often, the printed
output may have additional bells and whistles to make the output more friendly to the users.
However, these bells and whistles are not inherently part of the object
R has five basic or “atomic” classes of objects:
• character
• numeric (real numbers)
• integer
• complex
• logical (True/False)
The most basic type of R object is a vector. Empty vectors can be created with the
vector() function. There is really only one rule about vectors in R, which is that A vector can
only contain objects of the same class. But of course, like any good rule, there is an exception,
which is a list, which we will get to a bit later. A list is represented as a vector but can contain
objects of different classes. Indeed, that’s usually why we use them.
There is also a class for “raw” objects, but they are not commonly used directly in data analysis
Note that in the above example, T and F are short-hand ways to specify TRUE and FALSE.
However, in general one should try to use the explicit TRUE and FALSE values when
indicating logical values. The T and F values are primarily there for when you’re feeling lazy.
>x
[1] 0 0 0 0 0 0 0 0 0 0
To calculate frequency for State vector, you can use table function.
Since the above vector contains a NA (not available) value, the mean function returns NA.
To calculate mean for a vector excluding NA values, you can include na.rm = TRUE
parameter in mean function.
Some useful vectors can be created quickly with R. The colon operator is
[1] 1 2 3 4 5 6 7 8 9 10
> -3:4
[1] -3 -2 -1 0 1 2 3 4
> 9:5
[1] 9 8 7 6 5
More generally, the function seq() can generate any arithmetic progression.
[1] 2.0 2.4 2.8 3.2 3.6 4.0 4.4 4.8 5.2 5.6 6.0
Sometimes it’s necessary to have repeated values, for which we use rep()
> rep(5,3)
[1] 5 5 5
> rep(2:5,each=3)
[1] 2 2 2 3 3 3 4 4 4 5 5 5
> rep(-1:3,
length.out=10)[1] -1 0 1 2
3 -1 0 1 2 3
> 2^(0:10)
You can use subscripts to select the specific component of the list.
> x <- list(1:3, TRUE, "Hello", list(1:2, 5))
Here x has 4 elements: a numeric vector, a logical, a string and another list.
> x[[3]]
[1] "Hello"
> x[c(1,3)]
[[1]]
[1] 1 2 3
[[2]]
[1] "Hello"
We can also name some or all of the entries in our list, by supplying argument names to list():
>x
$y
[1] 1 2 3
[[2]]
[1] TRUE
$z
[1] "Hello"
Notice that the [[1]] has been replaced by $y, which gives us a clue as to
how we can recover the entries by their name. We can still use the numeric
position if we prefer:
> x$
[1] 1 2 3
> x[[1]]
[1] 1 2 3
The function names() can be used to obtain a character vector of all the
> names(x)
2.1 THEORY:
Matrices are much used in statistics, and so play an important role in R. To create a matrix
use the function matrix(), specifying elements by column first:
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
This is called column-major order. Of course, we need only give one of the dimensions:
[1,] 1 1 1 1
[2,] 2 2 2 2
[3,] 3 3 3 3
> diag(3)
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
> diag(1:3)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 2 0
[3,] 0 0 3
[1,] 1 2 3 4 5
[2,] 2 4 6 8 10
[3,] 3 6 9 12 15
[4,] 4 8 12 16 20
[5,] 5 10 15 20 25
The last operator performs an outer product, so it creates a matrix with (i, j)-th entry xiyj . The
function outer() generalizes this to any function f on two arguments, to create a matrix with
entries f(xi , yj ). (More on functions later.)
[1,] 2 3 4 5
[2,] 3 4 5 6
[3,] 4 5 6 7
[,1]
[1,] 30
[2,] 36
[3,] 45
[1,] 1 4 7
[2,] 4 10 16
[3,] 9 18 30
> t(A) #
transpose[,1] [,2]
[,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 10
> det(A) #
determinant[1] -3
[1] 1 5 10
2.2 ARRAY:
Of course, if we have a data set consisting of more than two pieces of categorical information
about each subject, then a matrix is not sufficient. The generalization of matrices to higher
dimensions is the array. Arrays are defined much like matrices, with a call to the array()
command. Here is a 2 × 3 × 3 array:
> arr
,,1
[1,] 1 3 5
[2,] 2 4 6
,,2
[1,] 7 9 11
[2,] 8 10 12
,,3
[1,] 13 15 17
[2,] 14 16 18
Each 2-dimensional slice defined by the last co-ordinate of the array is shown as a 2 × 3 matrix.
Note that we no longer specify the number of rows and columns separately, but use a single
vector dim whose length is the number of dimensions. You can recover this vector with the
dim() function.
> dim(arr)
[1] 2 3 3
subsetted and modified in exactly the same way as a matrix, only using the
> arr[1,2,3]
[1] 15
> arr[,2,]
[,1] [,2] [,3]
[1,] 3 9 15
[2,] 4 10 16
> arr[,,1,drop=FALSE]
,,1
[1,] 0 3 5
[2,] 2 4 6
Factors
R has a special data structure to store categorical variables. It tells R that a variable is
nominal or ordinal by making it a factor.
data$x = as.factor(data$x)
EXPERIMENT 4
Aim: To Create Sample (Dummy) Data in R and perform data manipulation with R
2.3 THEORY:
This covers how to execute most frequently used data manipulation tasks with R. It includes
various examples with datasets and code. It gives you a quick look at several functions used
in R.
# for multiple
➢ # OR
> DF[keeps]
> DF
name=c('a','b','c','d','e','e'),
marks=c(44,55,22,33,66,77))
> d3
d3[order(d3$roll),
]OR
d3[with(d3,order(roll)),]
2.6 SUBSETS:
roll=c(1:5)
names=c(letters[1:5])
marks=c(12,33,44,55,66)
d4=data.frame(roll,names,marks)
sub1=subset(d4,marks>33 & roll>4)
sub1
sub1=sub1=subset(d4,marks>33 & roll>4,select = c(roll,names))
sub1
d$class=c(1,2,1,2,1,2)
table(cls)
In this example, we are replacing "Delhi" with "Mumbai" in State variable. We need to
convert the variable from factor to character.
mydata$State = as.character(mydata$State)
mydata$State[mydata$State=='Delhi'] <- 'Mumbai'
2.13 SORTING
Sorting is one of the most common data manipulation task. It is generally used when
we want to see the top 5 highest / lowest values of a variable.
Note : "-" sign before mydata$SAT tells R to sort SAT variable in descending order.
2.18 CONCLUSION:
2.19 DATA2
RESULT=INTERSECT(DATA1$ROLL,D
ATA2$R OLL)RESULT
RESULT=MERGE(DATA1,DATA2,ALL=
FALSE)
RESULT
EXPERIMENT 5
Aim: Study and implementation of various control structures in R
2.20 THEORY:
Loop helps you to repeat the similar operation on different variables or on different columns or
on different datasets. For example, you want to multiple each variable by 5. Instead of multiply
each variable one by one, you can perform this task in loop. Its main benefit is to bring down the
duplication in your code which helps to make changes later in the code.
The If-Else statements are important part of R programming. In this tutorial, we will see various
ways to apply conditional statements (If..Else nested IF) in R. In R, there are a lot of powerful
packages for data manipulation. In the later part of this tutorial, we will see how IF ELSE
statements are used in popular packages.
Let's create a sample data to show how to perform IF ELSE function. This data frame would be
used further in examples.
x1 x2 x3
1 129 A
3 178 B
5 140 C
7 186 D
9 191 E
11 104 F
13 150 G
15 183 H
17 151 I
19 142 J
set.seed(123)
mydata = data.frame(x1 = seq(1,20,by=2),
x2 = sample(100:200,10,FALSE),
x1 = seq(1,20,by=2) : The variable 'x1' contains alternate numbers starting from 1 to 20. In
total,
these are 10 numeric values.
x2 = sample(100:200,10,FALSE) : The variable 'x2' constitutes 10 non-repeating random
numbers ranging between 100 and 200.
The ifelse() function in R works similar to MS Excel IF function. See the syntax below -
Suppose you are asked to create a binary variable - 1 or 0 based on the variable 'x2'. If value of a
variable 'x2' is greater than 150, assign 1 else 0.
mydata$x4 = ifelse(mydata$x2>150,1,0)
In this case, it creates a variable x4 on the same data frame 'mydata'. The output is shown in the
image below -
If variable 'x3' contains character values - 'A', 'D', the variable 'x1' should be multiplied by 2.
Otherwise it should be multiplied by 3.
x1 x2 x3
y 1 129
A2
3 178 B 9
5 140 C 15
2.28 7 186 D 14
9 191 E 27
11 104 F 33
13 150 G 39
15 183 H 45
17 151 I 51
19 142 J 57
EXPERIMENT 6
Aim: Data Manipulation with dplyr package
2.29 THEORY:
The dplyr package is one of the most powerful and popular package in R. This package was
written by the most popular R programmer Hadley Wickham who has written many useful R
packages such as ggplot2, tidyr etc. This post includes several examples and tips of how to use
dplyr package for cleaning and transforming data. It's a complete tutorial on data manipulation
and data wrangling with R.
People have been utilizing SQL for analyzing data for decades. Every modern data analysis
software such as Python, R, SAS etc supports SQL commands. But SQL was never designed to
perform data analysis. It was rather designed for querying and managing data. There are many
data analysis operations where SQL fails or makes simple things difficult. For example,
calculating median for multiple variables, converting wide format data to long format etc.
Whereas, dplyr package was designed to do data analysis.
The names of dplyr functions are similar to SQL commands such as select()for
selecting variables, group_by() - group data by grouping variable, join() - joining two
data sets. Also includes inner_join() and left_join(). It also supports sub queries for
which SQL was popular for.
2.34 HOW TO INSTALL AND LOAD DPLYR PACKAGE
install.packages("dplyr")
To load dplyr package, type the command below
library(dplyr)
In this tutorial, we are using the following data which contains income generated by states from
year 2002 to 2015. Note : This data do not contain actual income figures of the states.
This dataset contains 51 observations (rows) and 16 variables (columns). The snapshot of first 6
rows of the dataset is shown below.
Index State Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009
Submit the following code. Change the file path in the code below.
mydata = read.csv("C:\\Users\\Deepanshu\\Documents\\sampledata.csv")
The sample_n function selects random rows from a data frame (or table). The second parameter
of the function tells R the number of rows to select.
sample_n(mydata,3)
Index State Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009
33 N New York 1395149 1611371 1170675 1446810 1426941 1463171 1732098 1426216
The sample_frac function returns randomly N% of rows. In the example below, it returns
randomly 10% of rows.
sample_frac(mydata,0.1)
x1 = distinct(mydata)
In this dataset, there is not a single duplicate row so it returned same number of rows as in
mydata.
The .keep_all function is used to retain all other variables in the output data frame.
In the example below, we are using two variables - Index, Y2010 to determine uniqueness.
SELECT( ) FUNCTION
Suppose you are asked to select only a few variables. The code below selects variables "Index",
columns from "State" to "Y2008".
mydata2 = select(mydata, Index, State:Y2008)
Helpers Description
starts_with() Starts with a prefix
ends_with() Ends with a prefix
contains() Contains a literal string
matches() Matches a regular expression
num_range() Numerical range like x01, x02, x03.
one_of() Variables in character vector.
everything() All variables.
EXAMPLE 9 : SELECTING VARIABLES CONTAIN 'I' IN THEIR NAMES
mydata4 = select(mydata, contains("I"))
The code below keeps variable 'State' in the front and the remaining variables follow that.
[1] "State" "Index" "Y2002" "Y2003" "Y2004" "Y2005" "Y2006" "Y2007" "Y2008" "Y2009"
RENAME( ) FUNCTION
FILTER( ) FUNCTION
Suppose you need to subset data. You want to filter rows and retain only those values in which
Index is equal to A.
mydata7 = filter(mydata, Index == "A")
Index State Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009
The %in% operator can be used to select multiple items. In the following program, we are
telling R to select rows against 'A' and 'C' in column 'Index'.
Suppose you need to apply 'AND' condition. In this case, we are picking data for 'A' and 'C' in
the column 'Index' and income greater than 1.3 million in Year 2002.
mydata8 = filter(mydata6, Index %in% c("A", "C") & Y2002 >= 1300000 )
The 'I' denotes OR in the logical condition. It means any of the two conditions.
mydata9 = filter(mydata6, Index %in% c("A", "C") | Y2002 >= 1300000)
The grepl function is used to search for pattern matching. In the following code, we are
lookingfor records wherein column state contains 'Ar' in their name.
SUMMARISE( ) FUNCTION
In the example below, we are calculating mean and median for the variable Y2015.
In the following example, we are calculating number of records, mean and median for
variablesY2005 and Y2006. The summarise_at function allows us to select multiple
variables by their names.
Suppose you want to subtract mean from its original value and then calculate variance of it.
set.seed(222)
mydata <- data.frame(X1=sample(1:100,100), X2=runif(100))
summarise_at(mydata,vars(X1,X2), function(x) var(x - mean(x)))
X1 X2
ALTERNATIVE METHOD :
Second, the summarise_all function calculates summary statistics for all the columns in a
dataframe
summarise_all(numdata, funs(n(),mean,median))
EXAMPLE 22 : SUMMARIZE FACTOR VARIABLE
nlevels nmiss
1 19 0
arrange()
function :Use
Multiple Variables
The default sorting order of arrange() function is ascending. In this example, we are
sortingdata by multiple variables.
arrange(mydata, Index, Y2011)
Suppose you need to sort one variable by descending order and other variable by ascending
oder.
arrange(mydata, desc(Index), Y2011)
It is important to understand the pipe (%>%) operator before knowing the other
functions ofdplyr package. dplyr utilizes pipe operator from another package
(magrittr).
It allows you to write sub-queries like we do it in sql.
Note : All the functions in dplyr package can be used without the pipe operator. The
questionarises "Why to use pipe operator %>%". The answer is it lets to wrap
multiple functions together with the use of %>%.
SYNTAX :
filter(data_frame, variable == value)
or
data_frame %>% filter(variable == value)
The %>% is NOT restricted to filter function. It can be used with any function.
EXAMPLE :
The code below demonstrates the usage of pipe %>% operator. In this example, we are
selecting10 random observations of two variables "Index" "State" from the data frame
dt = sample_n(select(mydata, Index, State),10)
or
"mydata".
SYNTAX :
group_by(data, variables)
or
data %>% group_by(variables)
OBJECTIVE
The objective of this lab is to understand various aspects of research.
theory and data, and communicating their results and analysis in writing