UNIT 1 R Handouts-UN
UNIT 1 R Handouts-UN
1.1 Overview of R
"R is an interpreted computer programming language which was
created by Ross Ihaka and Robert Gentleman at the University of
Auckland, New Zealand." The R Development Core Team currently
develops R. It is also a software environment used to analyze statistical
information, graphical representation, reporting, and data modeling. R is
the implementation of the S programming language, which is combined
with lexical scoping semantics.
Features of R Programming
R is a domain-specific programming language which aims to do data analysis.
It has some unique features which make it very powerful. The most important
arguably being the notation of vectors. These vectors allow us to perform a
complex operation on a set of values in a single command. There are the
following features of R programming:
Vectors
Lists
Matrices
Arrays
Factors
Data Frames
In R programming, the very basic data types are the R-objects
called vectors which hold elements of different classes as shown above. Please
note in R the number of classes is not confined to only the above six types. For
example, we can use many atomic vectors and create an array whose class will
become array.
Vectors:
When you want to create vector with more than one element, you
should use c() function which means to combine the elements into a vector.
Lists:
A list is an R-object which can contain many different types of elements inside
it like vectors, functions and even another list inside it.
Matrices:
A matrix is a two-dimensional rectangular data set. It can be created using a
vector input to the matrix function.
Array:
While matrices are confined to two dimensions, arrays can be of any number of
dimensions. The array function takes a dim attribute which creates the required
number of dimension. In the below example we create an array with two
elements which are 3x3 matrices each.
Factors:
Factors are the r-objects which are created using a vector. It stores
the vector along with the distinct values of the elements in the vector as labels.
The labels are always character irrespective of whether it is numeric or
character or Boolean etc. in the input vector. They are useful in statistical
modeling.
Data Frames
Data frames are tabular data objects. Unlike a matrix in data frame
each column can contain different modes of data. The first column can be
numeric while the second column can be character and third column can be
logical. It is a list of vectors of equal length.
Data Frames are created using the data.frame () function.
1.3 Reading and Writing Data
Functions for Reading Data into R
There are a few very useful functions for reading data into R.
1. read.table() and read.csv() are two popular functions used for reading
tabular data into R.
2. readLines() is used for reading lines from a text file.
3. source() is a very useful function for reading in R code files from a
another R program.
4. dget() function is also used for reading in R code files.
5. load() function is used for reading in saved workspaces
6. unserialize() function is used for reading single R objects in binary
format.
Writing data files with write.table()
Following are few important arguments usually used in write.table() function.
1. x, the object to be written, typically a data frame
2. file, the name of the file which the data are to be written to
3. sep, the field separator string
4. col.names, a logical value indicating whether the column names of x are
to be written along with x, or a character vector of column names to be
written
5. row.names, a logical value indicating whether the row names of x are to
be written along with x, or a character vector of row names to be written
6. na, the string to use for missing values in the data.
1.4 Subsetting R Objects
In R Programming Language, subsetting allows the user to access
elements from an object. It takes out a portion from the object based on the
condition provided. There are 4 ways of subsetting in R programming. Each of
the methods depends on the usability of the user and the type of object. For
example, if there is a dataframe with many columns such as states, country, and
population and suppose the user wants to extract states from it, then subsetting
is used to do this operation. In this article, let us discuss the implementation of
different types of subsetting in R programming.
1.5 Installation of R:
R programming is a very popular language and to work on that we
have to install two things, i.e., R and RStudio. R and RStudio works together to
create a project on R.Installing R to the local computer is very easy. First, we
must know which operating system we are using so that we can download it
accordingly.The official site https://cloud.r-project.org provides binary files for
major operating systems including Windows, Linux, and Mac OS. In some
Linux distributions, R is installed by default, which we can verify from the
console by entering R.
To install R, either we can get it from the site https://cloud.r-
project.org or can use commands from the terminal.
1.6 Running R:
1) Go to the official site of R programming
2) Click on the CRAN link on the left sidebar
3) Select a mirror
4) Click “Download R for Windows”
5) Click on the link that downloads the base distribution
6) Run the file and follow the steps in the instructions to install R.
1.7 Packages of R:
R packages are the collection of R functions, sample data, and
compile codes. In the R environment, these packages are stored under a
directory called "library." During installation, R installs a set of packages. We
can add packages later when they are needed for some specific purpose. Only
the default packages will be available when we start the R console. Other
packages which are already installed will be loaded explicitly to be used by the
R program.
There is the following list of commands to be used to check, verify,
and use the R package.
COMPLEX NUMBER IN R:
Numbers in R can be divided into 3 different categories: • Numeric: It
represents both whole and floating-point numbers. For example, 123, 32.43, etc.
• Integer: It represents only whole numbers and is denoted by L. For example,
23L, 39L, etc. • Complex: It represents complex numbers with imaginary parts.
The imaginary parts are denoted by i. For example, 2 + 3i, 5i, etc.
ROUNDING:
Round function in R, rounds off the values in its first argument to the
specified number of decimal places. Round() function in R rounds off the list of
values in vector and also rounds off the column of a dataframe. It can also
accomplished using signif() function. Let see an example of each.
• Round() function to round off the values of a vector.
• round off the values of vector using signif() function in R.
• round off a column in R dataframe using round() function.
• round off the values of column in dataframe using signif() function
MODULO AND INTEGER QUOTIENTS
Modulo:
Modulus Operation is an arithmetic operation in R which calculates the remainder
after division of two numeric variables. This recipe demonstrates how to carry out the
Modulus operation using two numeric variables while storing them in a third variable.
Step 1:
Creating two numeric variables We assign numbers to two variables
a = 10 b = 4
Step 2:
Multiplying the two variables We use the arithmetic operator " %% " to carry out this
task and finally store the result in a third variable
# storing the result of the modulus arithmetic operation of the two numbers stored in
variables 'a' and 'b' in 'result'
result = a %% b
# displaying the value stored in result
result
2
VARIABLE NAMES AND ASSIGNMENT
A variable provides us with named storage that our programs can manipulate. A
variable in R can store an atomic vector, group of atomic vectors or a combination of many
R objects. A valid variable name consists of letters, numbers and the dot or underline
characters. The variable name starts with a letter or the dot not followed by a number.
OPERATORS:
An operator is a symbol that tells the compiler to perform specific mathematical or
logical manipulations. R language is rich in built-in operators and provides following types
of operators.
Types of Operators
• Arithmetic Operators
• Relational Operators
• Logical Operators
• Assignment Operators
• Miscellaneous Operators
INTEGERS:
In order to create an integer variable in R, we invoke the integer function. We can be
assured that y is indeed an integer by applying the is.integer function.
Incidentally, we can coerce a numeric value into an integer with the as.integer
function.
FACTORS
Factors are the data objects which are used to categorize the data and store it as
levels. They can store both strings and integers. They are useful in the columns which have
a limited number of unique values. Like "Male, "Female" and True, False etc. They are
useful in data analysis for statistical modeling.
LOGICAL OPERATIONS:
Following table shows the logical operators supported by R language. It is
applicable only to vectors of type logical, numeric or complex. All numbers greater than 1
are considered as logical value TRUE. Each element of the first vector is compared with
the corresponding element of the second vector. The result of comparison is a Boolean
value.
UNIT 2
CONTROL STRUCTURES AND VECTORS
2.1 CONTROL STRUCTURES
Control statements are expressions used to control the execution and flow of the program
based on the conditions provided in the statements. In R, there are decision-making
structures like if-else that control execution of the program conditionally. There are
also looping structures that loop or repeat code sections based on certain conditions and
state. These structures are used to make a decision after assessing the variable.
In R programming, there are 8 types of control statements as follows:
if condition
if-else condition
for loop
nested loops
while loop
repeat and break statement
return statement
next statement
1. if
The if-else in R enforce conditional execution of code. They are an important part of R’s
decision-making capability. It allows us to make a decision based on the result of a
condition. The if statement contains a condition that evaluates to a logical output
CODE
if(a>b){
print("a is greater than b")
} else{
print("b is greater than a")
}
2. ifelse() Function
The ifelse() function acts like the if-else structure. The following is the syntax of
the ifelse() function in R:
ifelse(condition, exp_if_true, exp_if_false)
3. switch
The switch is an easier way to choose between multiple alternatives than multiple if-
else statements. The R switch takes a single input argument and executes a particular
code based on the value of the input. Each possible value of the input is called a case.
4. for loops
The for loop in R, repeats through sequences to perform repeated tasks. They work
with an iterable variable to go through a sequence. The following is the syntax of for
loops in R:
5. while Loops
The while loop in R evaluates a condition. If the condition evaluates to TRUE it loops
through a code block, whereas if the condition evaluates to FALSE it exits the loop. The
while loop in R keeps looping through the enclosed code block as long as the condition
is TRUE. This can also result in an infinite loop sometimes which is something to avoid.
6.break Statement
The break statement can break out of a loop. Imagine a loop searching a specific
element in a sequence. The loop needs to keep going until either it finds the element or
until the end of the sequence. If it finds the element early, further looping is not needed.
In such a case, the R break statement can “break” us out of the loop early.
8. repeat loop
The repeat loop in R initiates an infinite loop from the get-go. The only way to get out
of the loop is to use the break statement. The repeat loop is useful when you don’t know
the required number of iterations.
2.2 Function
An R function is created by using the keyword function. The basic syntax of an R function
definition is as follows −
R has many in-built functions which can be directly called in the program without
defining them first. We can also create and use our own functions referred as user
defined functions.
The scoping rules of a language determine how a value is associated with a free
variable in a function. R uses lexical scoping or static scoping. An alternative to lexical
scoping is dynamic scoping which is implemented by some languages. Lexical scoping
turns out to be particularly useful for simplifying statistical computations.
The scoping rules of a language determine how values are assigned to free variables.
Free variables are not formal arguments and are not local variables (assigned insided the
function body).
what is an environment?
An environment is a collection of (symbol, value) pairs, i.e. x is a symbol and 3.14 might
be its value. Every environment has a parent environment and it is possible for an
environment to have multiple “children”. The only environment without a parent is
the empty environment.
Typically, a function is defined in the global environment, so that the values of free
variables are just found in the user’s workspace. This behavior is logical for most people
and is usually the “right thing” to do. However, in R you can have functions defined inside
other functions (languages like C don’t let you do this). Now things get interesting—in this
case the environment in which a function is defined is the body of another function!
Here is an example of a function that returns another function as its return value.
Remember, in R functions are treated like any other object and so this is perfectly valid.
With lexical scoping the value of y in the function g is looked up in the environment in
which the function was defined, in this case the global environment, so the value of y is 10.
With dynamic scoping, the value of y is looked up in the environment from which the
function was called (sometimes referred to as the calling environment). In R the calling
environment is known as the parent frame. In this case, the value of y would be 2.
When a function is defined in the global environment and is subsequently called from the
global environment, then the defining environment and the calling environment are the
same. This can sometimes give the appearance of dynamic scoping.
2.4 Dates And Times
Dates
R has developed a special representation for dates and times. Dates are represented by
the Date class and times are represented by the POSIXct or the POSIXlt class. Dates are
stored internally as the number of days since 1970-01-01 while times are stored
internally as the number of seconds since 1970-01-01.
Times
Times are represented by the POSIXct or the POSIXlt class. POSIXct is just a very large
integer under the hood. It use a useful class when you want to store times in something
like a data frame. POSIXlt is a list underneath and it stores a bunch of other useful
information like the day of the week, day of the year, month, day of the month. This is
useful when you need that kind of information.
There are a number of generic functions that work on dates and times to help you extract
pieces of dates and/or times.
2.5 Introduction to Functions
2.5.1Preview Of Some Important R Data Structures
A data structure is a particular way of organizing data in a computer so that it can be used
effectively. The idea is to reduce the space and time complexities of different tasks. Data
structures in R programming are tools for holding multiple values. R’s base data structures
are often organized by their dimensionality (1D, 2D, or nD) and whether they’re
homogeneous (all elements must be of the identical type) or heterogeneous (the elements
are often of various types). This gives rise to the six data types which are most frequently
utilized in data analysis.
The most essential data structures used in R include:
Vectors
Lists
Dataframes
Matrices
Arrays
Factors
One of the key features of R is that it can handle complex statistical operations in an
easy and optimized way. R handles complex computations using: Vector – A basic data
structure of R containing the same type of data Matrices – A matrix is a rectangular array
of numbers or other mathematical objects. We can do operations such as addition and
multiplication on Matrix in R. Lists – Lists store collections of objects when vectors are of
same type and length in a matrix. Data Frames – Generated by combining together
multiple vectors such that each vector becomes a separate column
2.6 Vectors in R
In R, Vector is a basic data structure in R that contains element of similar type. These data
types in R can be logical, integer, double, character, complex or raw.
In R using the function, typeof() one can check the data type of vector.
One more significant property of R vector is its length. The function length() determines
the number of elements in the vector.
Adding and Deleting Vector Elements
Vectors are stored like arrays in C, contiguously, and thus you cannot insert or delete
elements—something you may be used to if you are a Python programmer. The size of a
vector is determined at its creation, so if you wish to add or delete elements.
Obtaining the Length of a Vector
If one needs to convert a numerical value to a string, one can use the "paste"
function, as shown below: > paste(5) [1] "5" This function can also be used to concatenate
corresponding elements of vectors containing strings, as shown here:
> paste(c("A","B","C"),
c("1","2","3"))
[1] "A 1" "B 2" "C 3"
Notice in the example above, a single space was inserted between each letter and
number. If one prefers to concatenate the elements with no space between them, one can
use the paste0() function instead, as the following suggests. >
paste0(c("A","B","C"),c("1","2","3")) [1] "A1" "B2" "C3"
Matrices
can represent the binding of two or more vectors of equal length. Analogous
operations can be used to change the size of a matrix. For instance, the rbind() (row bind)
and cbind() (column bind) functions let you add rows or columns to a matrix
Applying Functions to Matrix Rows and Columns
One of the most famous and most used features of R is the *apply() family of functions,
such as apply(), tapply(), and lapply(). Here, we’ll look at apply(), which instructs R to call
a user-specified function on each of the rows or each of the columns of a matrix.
Using the apply() Function
This is the general form of apply for matrices:
2.9 Lists in R
Lists are R Data Types stores collections of objects of differing lengths and types
using list() function. In contrast to a vector, in which all elements must be of the same
mode, R’s list structure can combine objects of different types.. The list plays a central role
in R, forming the basis for data frames, object-oriented programming, and so on.
Creating Lists
Technically, a list is a vector. Ordinary vectors—those of the type we’ve been using
so far in this book—are termed atomic vectors, since their components cannot be broken
down into smaller components. In contrast, lists are referred to as recursive vectors.
Let’s consider an employee database. For each employee, we wish to store the name,
salary, and a Boolean indicating union membership. Since we have three different modes
here—character, numeric, and logical—it’s a perfect place for using lists. Our entire
database might then be a list of lists, or some other kind of list such as a data frame, though
we won’t pursue that here.
2.10 Data Frames
The sequence and number of observations in the vectors must be the same for each
vector in the Data Frame to represent a DataSet.
The first, second and third entries in each vector, for example, must represent the
observations collected from first, second and third sampling units respectively.
Programming in R
There are several built-in functions library and add-on tools available for R and they
continue to grow at an incredible rate. Yet programs need performing a task for which no
functions exist. Since R is itself a programming language, extending its functionality to
accommodate more procedures depends on the complexity of the procedure and the level
of R proficiency of the user.
2.11 Classes Vectors: Generating sequences
sequence() function in R Language is used to create a vector of sequenced elements. It
creates vectors with specified length, and specified differences between elements. It is
similar to seq() function.
Syntax: sequence(x)
Parameters:
x: Maximum element of vector
Here, the logical expression is my_vec > 4 and R will only extract those elements
that satisfy this logical condition. So how does this actually work? If we look at the
output of just the logical expression without the square brackets you can see that R
returns a vector containing either TRUE or FALSE which correspond to whether the
logical condition is satisfied for each element. In this case only the 4 th and 8th elements
return a TRUE as their value is greater than 4.
Programming in R There are several built-in functions library and add-on tools
available for R and they continue to grow at an incredible rate. Yet programs need
performing a task for which no functions exist. Since R is itself a programming
language, extending its functionality to accommodate more procedures depends on the
complexity of the procedure and the level of R proficiency of the user. User Created
Functions: Expressions – Command entered at R command prompt. Assignment –
Assigns name to an object. Arithmetic Operations – When numeric values are there, we
use arithmetic operations to perform operations.
Vectors and subscript
The vector type is really the heart of R. It’s hard to imagine R code, or even an
interactive R session, that doesn’t involve vectors. The elements of a vector must all
have the same mode, or data type. We can have a vector consisting of three character
strings (of mode character) or three integer elements (of mode integer), but not a vector
with one integer element and two character string elements. Vectors in R are the same as
the arrays in C language which are used to hold multiple data values of the same type.
One major key point is that in R the indexing of the vector will start from ‘1’ and not
from ‘0’. We can create numeric vectors and character vectors as well.
Recycling - The automatic lengthening of vectors in certain settings
The most commonly-used function for array multiplication is the dot function, which
takes two array inputs x and y and returns their "dot product". It constructs a product by
summing over the last index of array x, and over the next-to-last index of array y (or over
its last index, if y is a 1D array). This may sound like a complicated rule, but you should be
able to convince yourself that it corresponds to the appropriate type of multiplication
operation for the most common cases encountered in linear algebra:
2.21VECTOR INDEXING:
Vector elements are accessed using indexing vectors, which can be numeric, character or
logical vectors.You can access an individual element of a vector by its position (or
"index"), indicated using square brackets. In R, the first element has an index of 1. You
can access multiple elements of a vector by specifying a vector of element indices inside
the square brackets. All the methods that we learned about in the last section can be used
to generate these indexing vectors.
2.22 Common Operations on Vectors
Vectors are the most basic data types in R. Even a single object created is also stored
in the form of a vector. Vectors are nothing but arrays as defined in other languages.
Vectors contain a sequence of homogeneous types of data. If mixed values are given then
it auto converts the data according to the precedence. There are various operations that
can be performed on vectors in R.
1. Combining Vector in R
Functions are used to combine vectors. In order to combine the two vectors in R, we will
create two new vectors ‘n’ and ‘s’. Then, we will create another vector that will combine
these two using c(n,s) as follows:
For example:
> #Author DataFlair
> n = c(1, 2, 3, 4)
> s = c("Hadoop", "Spark", "HIVE", "Flink")
> c(n,s)
2. Arithmetic Operations on Vectors in R
Arithmetic operations on vectors can be performed member-by-member.
For example:
Suppose we have two vectors a and b:
> #Author DataFlair
> a = c (1, 3)
> b = c (1, 3)
> a + b #Addition
For subtraction:
> a - b #Subtraction
For division:
> a / b #Division
For remainder operation:
> a %% b #Remainder Operation
3. Logical Index Vector in R
By using a logical index vector in R, we can form a new vector from a given vector,
which has the same length as the original vector. If the corresponding members of the
original vector are included in the slice, then vector members are TRUE and otherwise
FALSE.
For example:
> #Author DataFlair
> S = c("bb", "cc")
> L = c(TRUE, TRUE) #Defining our Logical Vector
> S[L] #This will return elements of vector S that corrospond to logic vector L
3. Numeric Index
For indexing a numerical value in R, we specify the index between square braces [ ]. If our
index is negative, then R will return us all the values except for the index that we have
specified. For example, specifying [-2] will prompt R to convert -2 into its absolute value
and then search for the value that occupies that index.
5. Duplicate Index
The index vector allows duplicate values. Hence, the following retrieves a member twice
in one operation.
For example:
> # Author DataFlair
> s = c("aa", "bb", "cc", "dd", "ee")
> s[c(2,3,3)]
6. Range Indexes
To produce a vector slice between two indexes, we can use the colon operator “:“. It is
convenient for situations involving large vectors.
For example:
> # Author DataFlair
> s = c("aa", "bb", "cc", "dd", "ee")
> s[1:3]
UNIT III
LISTS
List:
R list is the object which contains elements of different types – like strings,
numbers, vectors and another list inside it. R list can also contain a matrix or a function
as its elements. A list is a vector but with heterogeneous data elements. A list in R is
created with the use of list() function. R allows accessing elements of a list with the use
of the index value. In R, the indexing of a list starts with 1 instead of 0 like other
programming
languages.
Creating a List:
How to Create Lists in R Programming?
The process of creating a list is the same as a vector. In R, the vector is created with
the help of c() function. Like c() function, there is another function, i.e., list() which is
used to create a list in R. A list avoid the drawback of the vector which is data type. We
can add the elements in the list of different data types.
Syntax
list()
Outputs
Selected Documents: Documents containing the queried word.
Concordances: A table of concordances.
Concordance finds the queried word in a text and displays the context in which
this word is used. Results in a single color come from the same document. The widget
can output selected documents for further analysis or a table of concordances for the
queried word.
Note that the widget finds only exact matches of a word, which means that if you
query the word ‘do’, the word ‘doctor’ won’t appear in the results.
1. Information:
Documents: number of documents on the input.
Tokens: number of tokens on the input.
Examples
Now we can select those documents that contain interesting contexts and
output them to Corpus Viewer to inspect them further.
https://orange3-text.readthedocs.io/en/latest/_images/Concordance-
Example1.png
In the second example, we will output concordances instead. We will keep
the book-excerpts.tab in Corpus and the connection to Concordance. Our queried
word remains ‘doctor’.
This time, we will connect Data Table to Concordance and select Concordances
output instead. In the Data Table, we get a list of concordances for the queried
word and the corresponding documents. Now, we will save this table with Save
Data widget, so we can use it in other projects or for further analysis.
Data Frames
A data frame is a table or a two-dimensional array-like structure in which each
column contains values of one variable and each row contains one set of values
from each column.
Access Item
Note that in that second call, since examsquiz[2:5,2] is a vector, R created a vector
instead of another data frame. We can also do filtering. Here’s how to extract the
subframe of all students whose first exam score was at least 3.8:
3 4 4.0 4.0
As the name indicates, Missing values are those elements which are not known.
NA or NaN are reserved words that indicate a missing value in R Programming
language for q arithmetical operations that are undefined.
R – handling Missing Values
Missing values are practical in life. For example, some cells in spreadsheets are
empty. If an insensible or impossible arithmetic operation is tried then NAs occur.
Suppose the second exam score for the first student had been missing. Then we
would have typed the following into that line when we were preparing the data file:
2.0 NA 4.0
In any subsequent statistical analyses, R would do its best to cope with the
missing data. However, in some situations, we need to set the option na.rm=TRUE
(Remove the NA values), explicitly telling R to ignore NA values. For instance,
with the missing exam score, calculating the mean score on exam 2 by calling R’s
mean() function would skip that first student in finding the mean. Otherwise, R
would just report NA for the mean.
Syntax: function(vector,na.rm)
Where,
na.rm in dataframe
We have to use apply function to apply the function on the dataframe with na.rm
function
Where
Output:
rbind(): The rbind or the row bind function is used to bind or combine the multiple
group of rows together.
rbind(x,x1)
Copy
Where:
X = the input data.
X1 = The data need to be binded.
The idea of binding rows using rbind()
The idea of binding or combing the rows of multiple data frames is highly
beneficial in data manipulation.The below diagram will definitely get you the idea
of working the rbind() function.
You can see that how rows of different data frames will bound/combined by the rbind()
function from the below diagram.
Use the rbind() function to add new rows in a Data Frame:
Data_Frame<-data.frame(
Training=c("Strength","Stamina","Other"),
Pulse=c(100,150,120),
Duration=c(60,30,45)
)
#Add a new row
New_row_DF<-rbind(Data_Frame,c("Strength",110,110))
#Print the new row
New_row_DF
Cbind:
Cbind () — column bind function is used for merging two data frames
together given that the number of rows in both the data frames are equal. cbind can
append vectors, matrices or any data frame by columns. This recipe demonstrates an
example using cbind.
Note : The number of rows in two dataframes needs to be same for both
cbind() function and bind_cols() function.
library(dplyr)
colbinded_df = bind_cols(df1,df2)
colbinded_df
so the resultant column bind data frame by using bind_cols() function will be
cbind() function and bind_cols() Function performs in the similar manner and can
be used alternatively for column binding. For Further understanding on bind_cols()
function refer r dplyr package document.
UNIT 4
FACTORS AND TABLES
Factors are used to represent categorical data and can be unordered or
ordered. One can think of a factor as an integer vector where each integer has a
label. Factors are important in statistical modeling and are treated specially by
modelling functions like lm() and glm(). Using factors with labels is better than
using integers because factors are self-describing. Having a variable that has values
“Male” and “Female” is better than a variable that has values 1 and 2. Factor objects
can be created with the factor() function.
Often factors will be automatically created for you when you read a dataset in
using a function like read.table(). Those functions often default to creating factors
when they encounter data that look like characters or strings. The order of the levels
of a factor can be set using the levels argument to factor(). This can be important in
linear modelling because the first level is used as the baseline level.
With factors, we have yet another member of the family of apply functions,
tapply. We’ll look at that function, as well as two other functions commonly used
with factors: split() and by().
The tapply() Function
Tapply() is used to apply a function over subsets of a vector. It is primarily used
when we have the following circumstances:
1. A dataset that can be broken up into groups (via categorical variables - aka
factors)
2. We desire to break the dataset up into groups
3. Within each group, we want to apply a function
The arguments to tapply() are as follows:
x is a vector
INDEX is a factor or a list of factors (or else they are coerced to factors)
FUN is a function to be applied
... contains other arguments to be passed FUN
# syntax of tapply function
tapply(x, INDEX, FUN, ..., simplify = TRUE)
To provide an example we’ll use the built in mtcars dataset and calculate the
mean of the mpg variable grouped by the cyl variable.
Let’s look at what happened. The function tapply() treated the vector
("R","D","D","R","U","D") as a factor with levels "D", "R", and "U". It noted that
"D" occurred in indices 2, 3 and 6; "R" occurred in indices 1 and 4; and "U"
occurred in index 5. For convenience, let’s refer to the three index vectors (2,3,6),
(1,4), and (5) as x, y, and z, respectively. Then tapply() computed mean(u[x]),
mean(u[y]), and mean(u[z]) and returned those means in a three-element vector.
And that vector’s element names are "D", "R", and "U", reflecting the factor levels
that were used by tapply().
The split() Function
In contrast to tapply(), which splits a vector into groups and then applies a
specified function on each group, split() stops at that first stage, just forming the
groups.
The basic form, without bells and whistles, is split(x,f), with x and f playing
roles similar to those in the call tapply(x,f,g); that is, x being a vector or data frame
and f being a factor or a list of factors. The action is to split x 124 Chapter 6 into
groups, which are returned in a list. (Note that x is allowed to be a data frame with
split() but not with tapply().)
The output of split() is a list, and recall that list components are denoted by
dollar signs. So the last vector, for example, was named "M.1" to indicate that it was
the result of combining "M" in the first factor and 1 in the second.
The vector g, taken as a factor, has three levels: "M", "F", and "I". The indices
corresponding to the first level are 1, 5, and 6, which means that g[1], g[5], and g[6]
all have the value "M". So, R sets the M component of the output to elements 1, 5,
and 6 of 1:7, which is the vector (1,5,6).
We can take a similar approach to simplify the code in our text concordance
example from Section 4.2.4. There, we wished to input a text file, determine which
words were in the text, and then output a list giving the words and their locations
within the text. We can use split() to make short work of writing the code, as
follows:
The call to scan() returns a list txt of the words read in from the file tf. So,
txt[[1]] will contain the first word input from the file, txt[[2]] will contain the
second word, and so on; length(txt) will thus be the total number of words read.
Meanwhile, txt itself, as the second argument in split() above, will be taken as
a factor. The levels of that factor will be the various words in the file. If, for
instance, the file contains the word world 6 times and climate was there 10 times,
then “world” and “climate” will be two of the levels of txt. The call to split() will
then determine where these and the other words appear in txt.
The by() Function
by() function in R applies a function to specified subsets of a data frame.First
parameter of by() function, takes up the data and second parameter is by which the
function is applied and third parameter is the function.
Syntax of by() function in R:
by(data, data$byvar, FUN)
an R object, normally a data frame, possibly a matrix.
data
aba$Gender: I
Call:
lm(formula = m[, 2] ~ m[, 3])
Coefficients:
(Intercept) m[, 3]
0.02997 1.21833
------------------------------------------------------------
aba$Gender:
M Call:
lm(formula = m[, 2] ~ m[, 3])
Coefficients:
(Intercept) m[, 3]
0.03653 1.19480
Calls to by() look very similar to calls to tapply(), with the first argument
specifying our data, the second the grouping factor, and the third the function to be
applied to each group. Just as tapply() forms groups of indices of a vector according
to levels of a factor, this by() call finds groups of row numbers of the data frame
aba.
That creates three subdata frames: one for each gender level of M, F, and I.
The anonymous function we defined regresses the second column of its matrix
argument m against the third column.
Tables are often essential for organzing and summarizing your data,
especially with categorical variables. When creating a table in R, it considers your
table as a specifc type of object (called “table”) which is very similar to a data
frame.
To begin exploring R tables, consider this example:
> u <- c(22,8,33,6,8,29,-2)
> fl <- list(c(5,12,13,12,13,5,13),c("a","bc","a","a","bc","a","a"))
> tapply(u,fl,length)
a bc
5 2 NA
12 1 1
13 2 1
Here, tapply() again temporarily breaks into subvectors, and then applies the
length() function to each subvector. (Note that this is independent of what’s in u.
Our focus now is purely on the factors.) Those subvector lengths are the counts of
the occurrences of each of the 3 × 2=6 combinations of the two factors. For instance,
5 occurred twice with "a" and not at all with "bc"; hence the entries 2 and NA in the
first row of the output. In statistics, this is called a contingency table.
The first argument in a call to table() is either a factor or a list of factors. The
two factors here were (5,12,13,12,13,5,13) and ("a","bc","a","a","bc", "a","a"). In
this case, an object that is interpretable as a factor is counted as one.
Typically a data frame serves as the table() data argument. Suppose for
instance the file ct.dat consists of election-polling data, in which candidate X is
running for reelection. The ct.dat file looks like this:
"Vote for X" "Voted For X Last Time"
"Yes" "Yes"
"Yes" "No"
"No" "No"
"Not Sure" "Yes"
"No" "No"
In the usual statistical fashion, each row in this file represents one subject
under study. In this case, we have asked five people the following two questions:
This gives us five rows in the data file. Let’s read in the file:
> ct <- read.table("ct.dat",header=T)
> ct
Vote.for.X Voted.for.X.Last.Time
1 Yes Yes
2 Yes No
3 No No
4 Not Sure Yes
5 No No
We can use the table() function to compute the contingency table for this data:
> cttab <- table(ct)
> cttab
Voted.for.X.Last.Time
Vote.for.X No Yes
No 2 0
Not Sure 0 1
Yes 1 1
for example, two people who said “no” to the first and second questions. The
1 in the middle-right indicates that one person answered “not sure” to the first
question and “yes” to the second question. We can also get one-dimensional counts,
which are counts on a single factor, as follows
> table(c(5,12,13,12,8,5))
5 8 12 13
2 1 2 1
Here’s an example of a three-dimensional table, involving voters’ genders,
race (white, black, Asian, and other), and political views (liberal or conservative)
The matrix/array operations can be used on data frames, they can be applied
to tables, too.
For example, we can access the table cell counts using matrix notation. Let’s
apply this to our voting example from the previous section.
In the second command, even though the first command had shown that cttab
had class “cttab”, we treated it as a matrix and printed out its “[1,1] element.”
Continuing this idea, the third command printed the first column of this “matrix.”
We can multiply the matrix by a scalar. For instance, here’s how to change cell
counts to proportions:
In statistics, the marginal values of a variable are those obtained when this
variable is held constant while others are summed. In the voting example, the
marginal values of the Vote.for.X variable are 2 + 0 = 2, 0 + 1 = 1, and 1 + 1 = 2.
We can of course obtain these via the matrix apply() function:
Note that the labels here, such as No, came from the row names of the matrix,
which table() produced. But R supplies a function addmargins() for this purpose that
is, to find marginal totals. Here’s an example:
V1 V2 V3
A B 1
A C 1
A D 0
A E 1
A F 0
A G 0
A H 0
Here, extract a subtable data2 and keep all line where V3 == 1 like:
V1 V2 V3
A B 1
A C 1
A E 1
It checks all v3 data which have 1 then it extracted to display in an another
subtable.
It can be difficult to view a table that is very big, with a large number of rows
or dimensions. One approach might be to focus on the cells with the largest
frequencies. That’s the purpose of the tabdom() function developed below it reports
the dominant frequencies in a table. Here’s a simple call:
tabdom(tbl,k)
The function tells us that the values 5 and 12 were the most frequent in d,
with four instances each, and the next most frequent value was 4, with two
instances.
As another example, consider our table cttab in the examples in the preceding
sections:
> tabdom(cttab,2)
1 No No 2
3 Yes No 1
So the combination No-No was most frequent, with two instances, with the
second most frequent being Yes-No, with one instance.
Well, how is this accomplished? It looks fairly complicated, but actually the
work is made pretty easy by a trick, exploiting the fact that you can present tables in
data frame format. Let’s use our cttab table again.
Note that this is not the original data frame ct from which the table cttab was
constructed. It is simply a different presentation of the table itself. There is one row
for each combination of the factors, with a Freq column added to show the number
of instances of each combination. This latter feature makes our task quite easy.
The sorting approach in line 7, which makes use of order(), is the standard
way to sort a data frame. The approach taken here converting a table to a data frame.
4.3 Math Functions
R contains built-in functions for the math operations and, of course, for
statistical distributions.
R includes an extensive set of built-in math functions. Here is a partial list:
• exp(): Exponential function, base e
• log(): Natural logarithm
• log10(): Logarithm base 10
• sqrt(): Square root
• abs(): Absolute value
• sin(), cos(), and so on: Trig functions
• min() and max(): Minimum value and maximum value within a vector
• which.min() and which.max(): Index of the minimal element and maximal
element of a vector
• pmin() and pmax(): Element-wise minima and maxima of several vectors
• sum() and prod(): Sum and product of the elements of a vector
• cumsum() and cumprod(): Cumulative sum and product of the elements of a
vector
• round(), floor(), and ceiling(): Round to the closest integer, to the closest
integer below, and to the closest integer above
• factorial(): Factorial function.
4.3.1 Calculating a Probability
Calculating a probability using the prod() function. Suppose we have n
th
independent events, and the i event has the probability pi of occurring. What is
the probability of exactly one of these events occurring?
Suppose first that n = 3 and our events are named A, B, and C. Then we
break down the computation as follows:
P(exactly one event occurs) = P(A and not B and not C) + P(not A and B and
not C) + P(not A and not B and C)
P(A and not B and not C) would be pA(1 − pB)(1 − pC ), and so on. For
general n, that is calculated as follows:
(The i th term inside the sum is the probability that event i occurs and all the
others do not occur.)
4.3.2 Cumulative Sums and Products
The functions cumsum() and cumprod() return cumulative sums and products.
Cumulative Products
cumprod() function in R Language is used to calculate the cumulative
product of the vector passed as argument.
Syntax: cumprod(x)
Parameters:
x: Numeric Object
Example 1:
# R program to illustrate
# the use of cumprod() Function
# Calling cumprod() Function
cumprod(1:4)
cumprod(-1:-6)
Output:
Cumulative sum
The cumsum() function in R computes the cumulative sum of elements in a
vector object.
Syntax
cumsum(x)
Example
# implementing the cumsum() function to take the sum of elements in the vector objecs
cumsum(1:10)
cumsum(c(2, 3, 1, -4, 2))
For max()
There is quite a difference between min() and pmin(). The former simply
combines all its arguments into one long vector and returns the minimum value in
that vector. In contrast, if pmin() is applied to two or more vectors, it returns a
vector of the pair-wise minima, hence the name pmin. Here’s an example:
min(numbers) # 2
min(characters) # "a"
Output
[1] 2
[1] "a"
Here,
[1] 2
4.3.4 Calculus
Calculus is a branch of mathematics that involves the study of rates of
change. Before calculus was invented, all math was static: It could only help
calculate objects that were perfectly still.
Calculus is a subset of mathematics concerned with the study of continuous
transition. Calculus is also known as infinitesimal calculus or “infinite calculus.”
The analysis of continuous change of functions is known as classical calculus.
Derivatives and integrals are the two most important ideas of calculus. The integral
is the measure of the region under the curve, while the derivative is the measure of
the rate of change of a function. The integral accumulates the discrete values of a
function over a number of values, while the derivative describes the function at a
given point. Two types of calculus are,
Differential Calculus
Differential Calculus deals with the issues of determining the rate of change
of a parameter with respect to other variables. Derivatives are used to find the
maxima and minima values of a function in order to find the best solution. The
analysis of the boundary of a quotient leads to differential calculus. It is concerned
with variables such as x and y, functions f(x), and the resulting variations in x and
y. Differentials are represented by the symbols dy and dx. Differentiation refers to
the method of determining derivatives. A function’s derivative is defined by dy/dx
or f’ (x). It denotes that the equation is the derivative of y with respect to x.
In R programming, derivative of a function can be computed
using deriv() and D() function. It is used to compute derivatives of simple expressions.
Syntax:
deriv(expr,name)
D(expr, name)
Parameters:
expr: represents an expression or a formula with no LHS
name: represents character vector to which derivatives will be computed
Example:
# Expression or formula
f = expression(x^2 + 5*x + 1)
# Derivative
cat("Using deriv() function:\n")
print(deriv(f, "x"))
cat("\nUsing D() function:\n")
print(D(f, 'x')
Output:
Integral Calculus
Applications of Calculus
Examining a system to discover the best approach to forecast any given circumstance
for a function.
Calculus concepts are widely used in everyday life, whether it is to solve problems
with complex shapes, assess survey results, determine the safety of automobiles,
design a business, track credit card payments, or determine how a system is
developing and how it affects us, etc.
Economists, biologists, architects, doctors, and statisticians all speak calculus. For
instance, engineers and architects employ several calculus ideas to determine the size
and design of construction structures.
Modeling ideas like occurrence and mortality rates, radioactive decay, reaction rates,
heat and light, motion, and electricity all employ calculus.
For a continuous distribution (like the normal), the most useful functions for doing
problems involving probability calculations are the "p" and "q" functions (c. d. f. and
inverse c. d. f.), because the the density (p. d. f.) calculated by the "d" function can only be
used to calculate probabilities via integrals and R doesn't do integrals.
For a discrete distribution (like the binomial), the "d" function calculates the density (p. f.),
which in this case is a probability
f(x) = P(X = x)
R has functions to handle many probability distributions. The table below gives the names
of the functions for each distribution and a link to the on-line documentation that is the
authoritative reference for how the functions are used. But don't read the on-line
documentation yet. First, try the examples in the sections following the table.
F Pf qf df rf
Student t Pt qt dt rt
UNIT V
OBJECT-ORIENTED PROGRAMMING
S Classes
Class System in R
While most programming languages have a single class system, R has three class systems:
S3 Class
S4 Class
Reference Class
The original R structure for classes, known as S3, is still the dominant class paradigm in R
use today. Indeed, most of R’s own built-in classes are of the S3 type.
An S3 class consists of a list, with a class name attribute and dispatch capability added.
S3 Class in R
S3 class is the most popular class in the R programming language. Most of the classes that
come predefined in R are of this type.
First we create a list with various components then we create a class using
the class() function. For example,
In the
above example, we have created a list named student1 with three components. Notice the
creation of class,
Here, Student_Info is the name of the class. And to create an object of this class, we have
passed the student1 list inside class().
Finally, we have created an object of the Student_Info class and called the object student1.
S4 Class in R
S4 class is an improvement over the S3 class. They have a formally defined structure
which helps in making objects of the same class look more or less similar.
In R, we use the setClass() function to define a class. For example,
Reference Class in R
Reference classes were introduced later, compared to the other two. It is more similar to
the object oriented programming we are used to seeing in other major programming
languages.
Here, we printed out the object lmout. (Remember that by simply typing the name
of an object in interactive mode, the object is printed.) The R interpreter then saw that
lmout was an object of class "lm" and thus called print.lm(), a special print method for the
"lm" class. In R terminology, the call to the generic function print() was dispatched to the
method print.lm() associated with the class "lm".Let’s take a look at the generic function
and the class method in this case:
You may be surprised to see that print() consists solely of a call to UseMethod(). But
this is actually the dispatcher function, so in view of print()’s role as a generic function,
you should not be surprised after all.
Writing S3 Classes
S3 classes have a rather cobbled-together structure. A class instance is created by
forming a list, with the components of the list being the member variables of the class. The
"class" attribute is set by hand by using the attr() or class() function, and then various
implementations of generic functions are defined. We can see this in the case of lm() by
inspecting the function:
Using Inheritance
The idea of inheritance is to form new classes as specialized versions of old ones. In
our previous employee example, for instance, we could form a new class devoted to hourly
employees, "hrlyemployee", as a subclass of "employee", as follows:
Our new class has one extra variable: hrsthismonth. The name of the new class
consists of two character strings, representing the new class and the old class. Our new
class inherits the methods of the old one. For instance, print.employee() still works on the
new class:
Once again, simply typing k resulted in the call print(k). In turn, that caused
UseMethod() to search for a print method on the first of k’s two class names,
"hrlyemployee". That search failed, so UseMethod() tried the other class name,
"employee", and found print.employee(). It executed the latter. Recall that in inspecting the
code for "lm", you saw this line:
Implementing a Generic Function on an S Class
Since joe is an S4 object, the action here is that show() is called. In fact, we would get the
same output by typing this:
The first argument gives the name of the generic function for which we will define a class-
specific method, and the second argument gives the class name. We then define the new
function.
S3 Vs S4
The S3 and S4 software in R are two generations implementing functional object-
oriented programming. S3 is the original, simpler for initial programming but less general,
less formal and less open to validation. The S4 formal methods and classes provide these
features but require more programming.
Visualization
Data visualization is the technique used to deliver insights in data using visual cues such
as graphs, charts, maps, and many others. This is useful as it helps in intuitive and easy
understanding of the large quantities of data and thereby make better decisions regarding
it.
Data Visualization in R Programming Language
The popular data visualization tools that are available are Tableau, Plotly, R,
Google Charts, Infogram, and Kibana. The various data visualization platforms have
different capabilities, functionality, and use cases. They also require a different skill set.
This article discusses the use of R for data visualization.
R is a language that is designed for statistical computing, graphical data analysis, and
scientific research. It is usually preferred for data visualization as it offers flexibility and
minimum required coding through its packages.
Types of Data Visualizations
Some of the various types of visualizations offered by R are:
Bar Plot
There are two types of bar plots- horizontal and vertical which represent data points
as horizontal or vertical bars of certain lengths proportional to the value of the data item.
They are generally used for continuous and categorical variable plotting. By setting
the horiz parameter to true and false, we can get horizontal and vertical bar plots
respectively.
Bar plots are used for the following scenarios:
1.To perform a comparative study between the various data categories in the data set.
2.To analyze the change of a variable over time in months or years.
Histogram
A histogram is like a bar chart as it uses bars of varying height to represent data
distribution. However, in a histogram values are grouped into consecutive intervals called
bins. In a Histogram, continuous values are grouped and displayed in these bins whose size
can be varied.
For a histogram, the parameter xlim can be used to specify the interval within which
all values are to be displayed. Another parameter freq when set to TRUE denotes the
frequency of the various values in the histogram and when set to FALSE, the probability
densities are represented on the y-axis such that they are of the histogram adds up to one.
Histograms are used in the following scenarios:
To verify an equal and symmetric distribution of the data.
To identify deviations from expected values.
Box Plot
The statistical summary of the given data is presented graphically using a boxplot. A
boxplot depicts information like the minimum and maximum data point, the median value,
first and third quartile, and interquartile range.
R is only preferred for data visualization when done on an individual standalone server.
Data visualization using R is slow for large amounts of data as compared to other
counterparts.
Application Areas:
Presenting analytical conclusions of the data to the non-analysts departments of your
company.
Health monitoring devices use data visualization to track any anomaly in blood pressure,
cholesterol and others.
Meteorologists use data visualization for assessing prevalent weather changes throughout
the world.
Simulation
Simulations are a powerful statistical tool. Simulation techniques allow us to carry out
statistical inference in complex models, estimate quantities that we can cannot calculate
analytically or even to predict under different scenarios the outcome of some scenario such
as an epidemic outbreak. In this section, we will cover the basics of simulations and
simulation experiments. It will cover
In the code below, we have a burnin period, where we discard the initial
observations to allow the markov chain time to get close to stationarity (the point where
iterations start to come from the desired distribution), and we thin (only keep every 10th
value) to reduce the correlation between our generated data.
Simulation Studies/Experiments
What is a simulation study?
A numerical technique for conducting experiments on a computer
It involves randomly sampling from probability distributions
Why conduct a simulation study?
To validate a statistical method so people can use it with confidence
Examine analytic properties that are rarely possible to calculate exactly
Check how large N properties behave in (finite) samples
Check how a statistical technique performs when the assumptions are not met
Simulating an Epidemic from a SIR Model
Code profiling tools allows to analyze the performance of the code by measuring the time
it takes the methods to run and the amount of CPU and memory they consume.
Characteristic Features:
Tie every slow distributed trace to the methods and threads that executed the
request
Quickly detect and resolve anomalous spikes in infrastructure metrics caused by
inefficiencies in your code
Compare code behavior and impact across hosts, services, and versions during code
deployments
Statistical Analysis with R
Statistical Analysis with R is one of the best practices which the statistician, data
analysts, and data scientists do while analyzing statistical data. R language is a popular
open-source programming language that extensively supports built-in packages and
external packages for statistical analysis. R language natively supports basic statistical
calculations for exploratory data, and advanced statistics for predictive data analysis
Statistical analysis with R is an important part of identifying data patterns based upon the
statistical rules and business constraints. Due to the simplicity of R syntax and flexibility
of using advanced packages. R language is preferred for Statistical Analysis.
Data Manipulation
Data manipulation involves modifying data to make it easier to read and to be more
organized. We manipulate data for analysis and visualization. It is also used with the term
‘data exploration’ which involves organizing data using available sets of variables.
At times, the data collection process done by machines involves a lot of errors and
inaccuracies in reading. Data manipulation is also used to remove these inaccuracies and
make data more accurate and precise.
arrange() method
In R, the arrange() method is used to order the rows based on a specified column.
The syntax of arrange() method is specified below-
arrange(dataframeName, columnName)
Example:
In the below code we ordered the data based on the runs from low to high using arrange()
function.
select() method
The select() method is used to extract the required columns as a table by specifying the
required column names in select() method. The syntax of select() method is mentioned
below-
select(dataframeName, col1,col2,…)
Example:
Here in the below code we fetched the player, wickets column data only using select()
method.
rename() method
The rename() function is used to change the column names. This can be done by the
below syntax-
rename(dataframeName, newName=oldName)
Example:
In this example, we change the column name “runs” to “runs_scored” in stats data frame.
mutate() & transmute() methods
These methods are used to create new variables. The mutate() function creates
new variables without dropping the old ones but transmute() function drops the old
variables and creates new variables. The syntax of both methods is mentioned below-
mutate(dataframeName, newVariable=formula)
transmute(dataframeName, newVariable=formula)
summarize() method
Using the summarize method we can summarize the data in the data frame by using
aggregate functions like sum(), mean(), etc. The syntax of summarize() method is
specified below-
summarize(dataframeName, aggregate_function(columnName))
Example:
In the below code we presented the summarized data present in the runs column using
summarize() method.