Introduction To Analytics and R-Programming
Introduction To Analytics and R-Programming
and R-Programming
Introduction to R
► The R console is the most important tool for using R. The R console is
a tool that allows you to type commands into R and see how the R
system responds. The commands that you type into the console are
called expressions. A part of the R system called the interpreter will
read the expressions and respond with a result or an error message.
Sometimes, you can also enter an expression into R through the
menus.
► By default, R will display a greater-than sign (“>”) in the console (at
the beginning of a line, when nothing else is shown) when R is
waiting for you to enter a command into the console. R is
prompting you to type something, so this is called a prompt.
Everything in R is an object.
R has 6 basic data types. (In addition to the five listed below, there is
also raw which will not be discussed in this workshop.)
► character
► numeric (real or decimal)
► integer
► logical
► complex
Datastructures in R
► There are six data types of these atomic vectors, also termed as six
classes of vectors. The other R-Objects are built upon the atomic
vectors.
Basic Functions in R console
► Creating an R file
► Saving an R file
► Clearing the Console: The console can be cleared using the shortcut key “ctrl + L“.
.
Execution of an R file:
There are several ways in which the execution of the commands that are available in
the R file is done.
► Using the run command: This “run” command can be executed using the GUI, by
pressing the run button there, or you can use the Shortcut key control + enter.
► What does it do? It will execute the line in which the cursor is there.
► Using the source with echo command: This “source with echo” command can be
executed using the GUI, by pressing the source with echo button there, or you can
use the Shortcut key control + shift + enter.
► What does it do? It will print the commands also, along with the output you are
printing.
► Clearing the Environment: Variables on the R environment can be cleared in
two ways:
► Using rm() command: When you want to clear a single variable from the R
environment you can use the “rm()” command followed by the variable you
want to remove.
► Typing rm(variable) will delete the variable which you want to remove. If
you want to delete all the variables that are there in the environment
what you can do is you can use the “rm” with an argument “list” is equal
to “ls” followed by a parenthesis.
► Using the GUI: We can also clear all the variables in the environment using
the GUI in the environment pane by using the brush button.
Run command over Source
command:
► Like most other languages, R lets you assign values to variables and refer to them
by name.
► In R, the assignment operator is <-. Usually, this is pronounced as “gets.” For
example, the statement:
x <- 1
► is usually read as “x gets 1.”)
► After you assign a value to a variable, the R interpreter will substitute that value in
place of the variable name when it evaluates an expression.
Comments in R
Keywords are the words reserved by a program because they have a special
meaning thus a keyword can’t be used as a variable name, function name, etc. We
can view these keywords by using either help(reserved) or ?reserved.
► Here’s a simple example:
> x <- 1
> y <- 2
> z <- c(x,y)
>z
[1] 1 2
Notice that the substitution is done at the time that the value is
assigned to z, not at the time that z is evaluated.
Suppose that you were to type in the preceding three expressions and
then change the value of y. The value of z would not change:
> y <- 4
>z
[1] 1 2
But try and see if you assign a new value to z, does the previous one
shows up?
► R provides several different ways to refer to a member (or set of members) of
a vector.
► You can refer to elements by location in a vector:
> b <- c(1,2,3,4,5,6,7,8,9,10,11,12)
>b
[1] 1 2 3 4 5 6 7 8 9 10 11 12
> b[7]
[1] 7
> b[1:6]
[1] 1 2 3 4 5 6
> b[b %% 3 == 0]
[1] 3 6 9 12
Exercise
That’s the basic interface for executing R code in RStudio. Try doing these
simple tasks. If you execute everything correctly, you should end up with the
same number that you started with:
► Choose any number and add 2 to it.
► Multiply the result by 3.
► Subtract 6 from the answer.
► Divide what you get by 3.
The variables can be assigned values using leftward, rightward and equal to
operator. The values of the variables can be printed using print() or
cat()function.
► Variables can be deleted by using the rm() function. Let us delete the
variable var.3.
► On printing the value of the variable error is thrown.
Basic Operations in R
► When you perform an operation on two vectors, R will match the elements of the
two vectors pairwise and return a vector. For example:
► When you want to create vector with more than one element, you
should use c() function which means to combine the elements into
a vector.
apple <- c('red','green',"yellow")
apple
class(apple)
► When we execute the above code, it produces the following result:
[1] "red" "green" "yellow"
► [1] "character"
Types of vectors
► Vectors are of different types which are used in R. Following are some of
the types of vectors:
► Numeric vectors: Numeric vectors are those which contain numeric
values such as integer, float, etc.
► Character vectors: Character vectors contain alphanumeric values and
special characters.
► Logical vectors: Logical vectors contain boolean values such as TRUE,
FALSE and NA for Null values.
Creating a vector # we can use the c function
# to combine the values as a vector.
# By default the type will be double
X <- c(61, 4, 21, 67, 89, 2)
cat('using c function', X, '\n')
► There are different
ways of creating # seq() function for creating
vectors. Generally, # a sequence of continuous values.
# length.out defines the length of vector.
we use ‘c’ to
Y <- seq(1, 10, length.out = 5)
combine different cat('using seq() function', Y, '\n')
elements together.
# use':' to create a vector
# of continuous values.
Z <- 2:7
cat('using colon', Z)
Accessing vector elements
# R program to access elements of a Vector
# Creation of Vector
sort() function is used with the help of
X <- c(8, 2, 7, 1, 11, 2)
which we can sort the values in
ascending or descending order. # Sort in ascending order
A <- sort(X)
cat('ascending order', A, '\n')
print(vec)
R List to matrix # Defining list
lst1 <- list(list(1, 2, 3),
list(4, 5, 6))
# Print list
► We will create cat("The list is:\n")
matrices using print(lst1)
matrix() function in R cat("Class:", class(lst1), "\n")
programming.
# Convert list to matrix
Another function that
mat <- matrix(unlist(lst1), nrow =
will be used is unlist() 2, byrow = TRUE)
function to convert
the lists into a vector. # Print matrix
cat("\nAfter conversion to
matrix:\n")
print(mat)
cat("Class:", class(mat), "\n")
Matrices
► Matrix where all rows and columns are filled by a single constant ‘k’:
To create such a matrix the syntax is given below:
Syntax: matrix(k, m, n) # R program to illustrate
# special matrices
Parameters:
k: the constant # Matrix having 3 rows and 3
m: no of rows columns
n: no of columns # filled by a single
constant 5
print(matrix(5, 3, 3))
Diagonal matrix:
►
A diagonal matrix is a matrix in which the entries outside the main diagonal
are all zero. To create such a matrix the syntax is given below:
Syntax: diag(k, m, n)
Parameters: # R program to illustrate
k: the constants/array # special matrices
m: no of rows
n: no of columns # Diagonal matrix having 3 rows and
3 columns
# filled by array of elements (5,
3, 3)
print(diag(c(5, 3, 3), 3, 3))
Identity matrix:
# Accessing 2
print(A[1, 2])
# Accessing 6
print(A[2, 3])
# R program to illustrate
Accessing Submatrices: # access submatrices in a matrix
A = matrix(
c(1, 2, 3, 4, 5, 6, 7, 8, 9),
nrow = 3,
ncol = 3,
byrow = TRUE
)
cat("Before deleting the 2nd column\n")
print(A)
# 2nd-row deletion
A = A[, -2]
► Commutative: B + C = C + B
► Associative: For n number of matrices A + (B + C) = (A +
B) + C
► Order of the matrices involved must be same.
# R program to add two matrices
► Non-Commutative: B – C != C – B
► Non-Associative: For n number of matrices A – (B – C) !=
(A – B) – C
► Order of the matrices involved must be same.
# R program to multiply two matrices
► Commutative: B * C = C * B
► Associative: For n number of matrices A * (B * C) = (A *
B) * C
► Order of the matrices involved must be same.
# R program to divide two matrices
► Non-Commutative: B / C != C / B
► Non-Associative: For n number of matrices A / (B / C) !=
(A / B) / C
► Order of the matrices involved must be same.
Arrays
► Arrays are essential data storage structures defined by a fixed number of dimensions.
Arrays are used for the allocation of space at contiguous memory locations.
Uni-dimensional arrays are called vectors with the length being their only dimension.
Two-dimensional arrays are called matrices, consisting of fixed numbers of rows and
columns. Arrays consist of all elements of the same data type. Vectors are supplied
as input to the function and then create an array based on the number of
dimensions.
► While matrices are confined to two dimensions, arrays can be of any number of
dimensions. The array function takes a dim attribute which creates the required
number of dimension. In the below example we create an array with two elements
which are 3x3 matrices each.
a <- array(c('green','yellow'),dim=c(3,3,2))
print(a)
Creating an Array
# accessing elements
cat ("Third element of vector is : ",
vec[3])
Accessing entire matrices
vec1 <- c(1, 2, 3, 4, 5, 6, 7, 8, 9)
vec2 <- c(10, 11, 12)
row_names <- c("row1", "row2")
col_names <- c("col1", "col2", "col3")
mat_names <- c("Mat1", "Mat2")
arr = array(c(vec1, vec2), dim = c(2, 3, 2),
dimnames = list(row_names,
col_names, mat_names))
► Factors are the R-objects which are created using a vector. It stores the vector
along with the distinct values of the elements in the vector as labels.
► The labels are always character irrespective of whether it is numeric or character
or Boolean etc. in the input vector. They are useful in statistical modeling.
► Factors are created using the factor() function. The nlevels functions gives the
count of levels.
apple_colors <- c('green','green','yellow','red','red','red','green')
factor_apple <- factor(apple_colors)
factor_apple
nlevels(factor_apple)
► Factors in R Programming Language are data structures that
are implemented to categorize the data or represent
categorical data and store it on multiple levels.
► They can be stored as integers with a corresponding label to
every unique integer. Though factors may look similar to
character vectors, they are integers and care must be taken
while using them as strings. The factor accepts only a restricted
number of distinct values. For example, a data field such as
gender may contain values only from female, male, or
transgender.
► In the above example, all the possible cases are known
beforehand and are predefined. These distinct values
are known as levels. After a factor is created it only
consists of levels that are by default sorted
alphabetically.
Attributes of Factors in R Language
# Creating a vector
x < -c("female", "male", "male", "female")
print(x)
► After a factor is formed, its components can be modified but the new values
which need to be assigned must be at the predefined level.
gender <- factor(c("female", "male", "male", "female" ));
gender[2]<-"female"
gender
► For selecting all the elements of the factor gender except ith element, gender[-i]
should be used. So if you want to modify a factor and add value out of
predefines levels, then first modify levels.
gender <- factor(c("female", "male", "male", "female" ));
# add new level
levels(gender) <- c(levels(gender), "other")
gender[3] <- "other"
gender
Factors in Data Frame
The Data frame is similar to a 2D array with the columns containing all
the values of one variable and the rows having one set of values from
every column. There are four things to remember about data frames:
► column names are compulsory and cannot be empty.
► Unique names should be assigned to each row.
► The data frame’s data can be only of three types- factor, numeric,
and character type.
► The same number of data items must be present in each column.
► In R language when we create a data frame, its column is
categorical data and hence a factor is automatically created on it.
We can create a data frame and check if its column is a factor.
► Data frames are tabular data objects. Unlike a matrix in data frame each column
can contain different modes of data. The first column can be numeric while the
second column can be character and third column can be logical. It is a list of
vectors of equal length. Data Frames are created using the data.frame() function.
BMI <- data.frame( gender = c("Male", "Male","Female"),
height = c(152, 171.5, 165),
weight = c(81,93, 78),
Age =c(42,38,26)
)
print(BMI)
Create Dataframe in R Programming
Language
# R program to create dataframe
► To create a data
frame in R # creating a data frame
friend.data <- data.frame(
use data.frame() com friend_id = c(1:5),
mand and then pass friend_name = c("Sachin", "Sourav",
each of the vectors "Dravid", "Sehwag",
you have created as "Dhoni"),
arguments to the stringsAsFactors = FALSE
)
function. # print the data frame
print(friend.data)
Get the Structure of the R – Data Frame
► Syntax:
merge(df1, df2, by.df1, by.df2, all.df1, all.df2, sort = TRUE)
► Parameters:
df1: one dataframe
df2: another dataframe
by.df1, by.df2: The names of the columns that are common to both df1
and df2.
all, all.df1, all.df2: Logical values that actually specify the type of merging
happens.
► First of all, we will create two dataframes that will help
us to understand each join easily.
► Inner join is used to keep only those rows that are matched from the
data frames, in this, we actually specify the argument all = FALSE. If
we try to understand this using set theory then we can say here we
are actually performing the intersection operation. For example:
► A = [1, 2, 3, 4, 5] B = [2, 3, 5, 6] Then the output of natural join will be
(2, 3, 5)
► It is the most simplest and common type of joins available in R. Now
let us try to understand this using R program:
# Joining of dataframes
# Joining of dataframes
► This join is somewhat like inner join, with only the left
data frame columns and values are selected. Now let
us try to understand this using R program:
# R program to illustrate
# Joining of dataframes
► Taking multiple inputs in R language is same as taking single input, just need to define
multiple readline() for inputs. One can use braces for define multiple readline() inside it.
► var1 = readline(“Enter 1st number : “);
var2 = readline(“Enter 2nd number : “);
var3 = readline(“Enter 3rd number : “);
var4 = readline(“Enter 4th number : “);
or,
{
var1 = readline(“Enter 1st number : “);
var2 = readline(“Enter 2nd number : “);
var3 = readline(“Enter 3rd number : “);
var4 = readline(“Enter 4th number : “);
}
{
var1 = readline("Enter 1st number : ");
var2 = readline("Enter 2nd number : ");
var3 = readline("Enter 3rd number : ");
var4 = readline("Enter 4th number : ");
}
► character:
var1 = readline(prompt = “Enter any character :
“);
var1 = as.character(var1)
# string input
var1 = readline(prompt = "Enter your name : ");
# character input
var2 = readline(prompt = "Enter any character : ");
# convert to character
var2 = as.character(var2)
# printing values
print(var1)
print(var2)
Using scan() method
x <- 5
if(x > 0){ print("Positive number") }
y <- -1
if(y > 0){ print("Positive number") }
z <- c(x,y)
if(z > 0){ print("Positive number") }
m <- c(y,x)
if(m > 0){ print("Positive number") }
if…else statement
if (test_expression) {
► The syntax of if…else statement is: statement1
} else {
statement2 }
fertilizer
[1] K K none N P P N N none P K none
Levels: K N none P
First, it is clear that R is no longer considering the
elements of the factor as strings of characters, given the
absence of double-quotes. Second (and more
importantly), additional information in the form of
"Levels: K N none P" is given. The levels shown
correspond to the unique values seen in the
vector ff (i.e., the categories that represent the
treatment groups).
There are other differences between a vector and a factor, which we can
see if we use the str(x) function. This function in R displays a compact
representation of the internal structure of any R variable. Let's see what
happens when we apply it to both f and fertilizer:
str(f)
# chr [1:12] "K" "K" "none" "N" "P" "P" "N" "N" "none" "P" "K" "none"
str(fertilizer)
# Factor w/ 4 levels "K","N","none",..: 1 1 3 2 4 4 2 2 3 4 ...
Note how in the factor fertilizer, the levels "K", "N", "none", and "P" are
replaced by numbers 1, 2, 3, and 4, respectively. So internally, R only stores
the numbers (indicating the level of each vector element) and
(separately) the names of each unique level. (Interestingly, even if the
vector's elements had been numerical, the levels are stored as strings of
text.)
The way R internally stores factors is important when we want to
combine them. Consider the following failed attempt to combine
factors a.fac and b.fac:
a.fac = factor(c("X","Y","Z","X"))
b.fac = factor(c("X","X","Y","Y","Z"))
factor(c(a.fac,b.fac))
# [1] 1 2 3 1 1 1 2 2 3 Levels: 1 2 3
Factors can also be used to create tables in R, another important data type
in terms of its relationship to statistics.
As an example, suppose that a sample of 7 people are asked the following
questions in a study of workplace risk of tetanus infections:
●Q1: "On any given day, is there a risk that you might be cut at work? (Yes,
No, or Maybe)"
●Q2: "Does your work put you in contact with soil, dust, or manure?" (Always,
Never, or Sometimes)
The answers to each question for subjects 1 through 7 are given by the
following factors:
Q1 = factor(c("Sometimes","Sometimes","Never","Always","Always","Sometimes","Sometimes"))
Q2 = factor(c("Maybe","Maybe","Yes","Maybe","No","Yes","No"))
Thinking that there might be a relationship between these two
variables, we wish to construct a contingency table -- where the
levels of one variable form the column headers and the levels of
the other variable form the row headers, with the body of the table
indicating how many subjects were associated with each possible
pair of levels.
To create such a table in R, we simply use the table() command, as
shown below:
Accessing the elements of the table
► One can also produce new tables from existing ones.
For example, suppose we wanted to see a table of
relative frequencies instead of counts. Much like one
might do with a vector, we simply divide the table by
the sum of its elements:
Practice exercises
# declaring a vector
vec = c(1.1,NA, 2, 3.0,NA )
print("Sum of the vector:") # using inbuilt product method
print("Product of the vector:")
# inbuilt sum method print(prod(vec))
print(sum(vec))
# ignoring missing values
# using inbuilt mean method print("Sum of the vector without NaN values:")
print("Mean of the vector with NaN values:") print(sum(vec,na.rm = TRUE))
# not ignoring NaN values # ignoring missing values
print(mean(vec)) print("Product of the vector without NaN values:")
print(prod(vec,na.rm = TRUE))
# ignoring missing values
print("Mean of the vector without NaN values:")
print(mean(vec,na.rm = TRUE))
Practice
► Create a Vector X with these values: 1.1, 2, 3.0, 5.7 and repeat the
exercise.
► Create a Vector Y with these values: 7,NA,9,8,NA,75,NA,65 and
repeat the exercise.
► Make two vectors A and B, A being a sequence of range 1:10 and B
in the range of 10:15 sorted in decreasing order. Calculate
A+B,A-B,B-A,A*B,A/B,B/A, A^B,B^A, remainder and quotient when A/B
and B/A.
► Make two vectors C and D, C being a sequence of range 11:20 and
D in the range of 5:14 sorted in decreasing order. Repeat the previous
Matrix exercise