0% found this document useful (0 votes)
11 views78 pages

All R Notes

Uploaded by

Raina Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views78 pages

All R Notes

Uploaded by

Raina Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

11/26/21, 12:39 PM R - Data Types

R - Data Types

Generally, while doing programming in any programming language, you need to use various
variables to store various information. Variables are nothing but reserved memory locations to
store values. This means that, when you create a variable you reserve some space in memory.

You may like to store information of various data types like character, wide character, integer,
floating point, double floating point, Boolean etc. Based on the data type of a variable, the
operating system allocates memory and decides what can be stored in the reserved memory.

In contrast to other programming languages like C and java in R, the variables are not declared
as some data type. The variables are assigned with R-Objects and the data type of the R-object
becomes the data type of the variable. There are many types of R-objects. The frequently used
ones are −
Vectors
Lists
Matrices
Arrays
Factors
Data Frames
The simplest of these objects is the vector object and there are six data types of these atomic
vectors, also termed as six classes of vectors. The other R-Objects are built upon the atomic
vectors.

https://www.tutorialspoint.com/r/r_data_types.htm 1/6
11/26/21, 12:39 PM R - Data Types

Data Type Example Verify

Logical TRUE, FALSE


Live Demo
v <- TRUE
print(class(v))

it produces the following result −

[1] "logical"

Numeric 12.3, 5, 999


Live Demo
v <- 23.5
print(class(v))

it produces the following result −

[1] "numeric"

Integer 2L, 34L, 0L


Live Demo
v <- 2L
print(class(v))

it produces the following result −

[1] "integer"

Complex 3 + 2i
Live Demo
v <- 2+5i
print(class(v))

it produces the following result −

[1] "complex"

Character 'a' , '"good", "TRUE",


Live Demo
'23.4' v <- "TRUE"
print(class(v))

it produces the following result −

https://www.tutorialspoint.com/r/r_data_types.htm 2/6
11/26/21, 12:39 PM R - Data Types

[1] "character"

Raw "Hello" is stored as 48 65


Live Demo
6c 6c 6f v <- charToRaw("Hello")
print(class(v))

it produces the following result −

[1] "raw"

In R programming, the very basic data types are the R-objects called vectors which hold
elements of different classes as shown above. Please note in R the number of classes is not
confined to only the above six types. For example, we can use many atomic vectors and create
an array whose class will become array.

Vectors

When you want to create vector with more than one element, you should use c() function which
means to combine the elements into a vector.

Live Demo
# Create a vector.
apple <- c('red','green',"yellow")
print(apple)

# Get the class of the vector.


print(class(apple))

When we execute the above code, it produces the following result −

[1] "red" "green" "yellow"


[1] "character"

Lists

A list is an R-object which can contain many different types of elements inside it like vectors,
functions and even another list inside it.

Live Demo
# Create a list.
list1 <- list(c(2,5,3),21.3,sin)

https://www.tutorialspoint.com/r/r_data_types.htm 3/6
11/26/21, 12:39 PM R - Data Types

# Print the list.


print(list1)

When we execute the above code, it produces the following result −

[[1]]
[1] 2 5 3

[[2]]
[1] 21.3

[[3]]
function (x) .Primitive("sin")

Matrices

A matrix is a two-dimensional rectangular data set. It can be created using a vector input to the
matrix function.

Live Demo
# Create a matrix.
M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)

When we execute the above code, it produces the following result −

[,1] [,2] [,3]


[1,] "a" "a" "b"
[2,] "c" "b" "a"

Arrays

While matrices are confined to two dimensions, arrays can be of any number of dimensions.
The array function takes a dim attribute which creates the required number of dimension. In the
below example we create an array with two elements which are 3x3 matrices each.

Live Demo
# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)

When we execute the above code, it produces the following result −

, , 1

[,1] [,2] [,3]

https://www.tutorialspoint.com/r/r_data_types.htm 4/6
11/26/21, 12:39 PM R - Data Types

[1,] "green" "yellow" "green"


[2,] "yellow" "green" "yellow"
[3,] "green" "yellow" "green"

, , 2

[,1] [,2] [,3]


[1,] "yellow" "green" "yellow"
[2,] "green" "yellow" "green"
[3,] "yellow" "green" "yellow"

Factors

Factors are the r-objects which are created using a vector. It stores the vector along with the
distinct values of the elements in the vector as labels. The labels are always character
irrespective of whether it is numeric or character or Boolean etc. in the input vector. They are
useful in statistical modeling.

Factors are created using the factor() function. The nlevels functions gives the count of levels.

Live Demo
# Create a vector.
apple_colors <- c('green','green','yellow','red','red','red','green')

# Create a factor object.


factor_apple <- factor(apple_colors)

# Print the factor.


print(factor_apple)
print(nlevels(factor_apple))

When we execute the above code, it produces the following result −

[1] green green yellow red red red green


Levels: green red yellow
[1] 3

Data Frames

Data frames are tabular data objects. Unlike a matrix in data frame each column can contain
different modes of data. The first column can be numeric while the second column can be
character and third column can be logical. It is a list of vectors of equal length.

Data Frames are created using the data.frame() function.

Live Demo
# Create the data frame.
BMI <- data.frame(
https://www.tutorialspoint.com/r/r_data_types.htm 5/6
11/26/21, 12:39 PM R - Data Types

gender = c("Male", "Male","Female"),


height = c(152, 171.5, 165),
weight = c(81,93, 78),
Age = c(42,38,26)
)
print(BMI)

When we execute the above code, it produces the following result −

gender height weight Age


1 Male 152.0 81 42
2 Male 171.5 93 38
3 Female 165.0 78 26

https://www.tutorialspoint.com/r/r_data_types.htm 6/6
11/26/21, 12:39 PM R - Vectors

R - Vectors

Vectors are the most basic R data objects and there are six types of atomic vectors. They are
logical, integer, double, complex, character and raw.

Vector Creation

Single Element Vector

Even when you write just one value in R, it becomes a vector of length 1 and belongs to one of
the above vector types.

Live Demo
# Atomic vector of type character.
print("abc");

# Atomic vector of type double.


print(12.5)

# Atomic vector of type integer.


print(63L)

# Atomic vector of type logical.


print(TRUE)

# Atomic vector of type complex.


print(2+3i)

# Atomic vector of type raw.


print(charToRaw('hello'))

When we execute the above code, it produces the following result −

[1] "abc"
[1] 12.5
[1] 63
[1] TRUE
[1] 2+3i
[1] 68 65 6c 6c 6f

Multiple Elements Vector

Using colon operator with numeric data


https://www.tutorialspoint.com/r/r_vectors.htm 1/5
11/26/21, 12:39 PM R - Vectors

Live Demo
# Creating a sequence from 5 to 13.
v <- 5:13
print(v)

# Creating a sequence from 6.6 to 12.6.


v <- 6.6:12.6
print(v)

# If the final element specified does not belong to the sequence then it is discard
v <- 3.8:11.4
print(v)

When we execute the above code, it produces the following result −

[1] 5 6 7 8 9 10 11 12 13
[1] 6.6 7.6 8.6 9.6 10.6 11.6 12.6
[1] 3.8 4.8 5.8 6.8 7.8 8.8 9.8 10.8

Using sequence (Seq.) operator

Live Demo
# Create vector with elements from 5 to 9 incrementing by 0.4.
print(seq(5, 9, by = 0.4))

When we execute the above code, it produces the following result −

[1] 5.0 5.4 5.8 6.2 6.6 7.0 7.4 7.8 8.2 8.6 9.0

Using the c() function

The non-character values are coerced to character type if one of the elements is a character.

Live Demo
# The logical and numeric values are converted to characters.
s <- c('apple','red',5,TRUE)
print(s)

When we execute the above code, it produces the following result −

[1] "apple" "red" "5" "TRUE"

Accessing Vector Elements

https://www.tutorialspoint.com/r/r_vectors.htm 2/5
11/26/21, 12:39 PM R - Vectors

Elements of a Vector are accessed using indexing. The [ ] brackets are used for indexing.
Indexing starts with position 1. Giving a negative value in the index drops that element from
result.TRUE, FALSE or 0 and 1 can also be used for indexing.

Live Demo
# Accessing vector elements using position.
t <- c("Sun","Mon","Tue","Wed","Thurs","Fri","Sat")
u <- t[c(2,3,6)]
print(u)

# Accessing vector elements using logical indexing.


v <- t[c(TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE)]
print(v)

# Accessing vector elements using negative indexing.


x <- t[c(-2,-5)]
print(x)

# Accessing vector elements using 0/1 indexing.


y <- t[c(0,0,0,0,0,0,1)]
print(y)

When we execute the above code, it produces the following result −

[1] "Mon" "Tue" "Fri"


[1] "Sun" "Fri"
[1] "Sun" "Tue" "Wed" "Fri" "Sat"
[1] "Sun"

Vector Manipulation

Vector arithmetic

Two vectors of same length can be added, subtracted, multiplied or divided giving the result as
a vector output.

Live Demo
# Create two vectors.
v1 <- c(3,8,4,5,0,11)
v2 <- c(4,11,0,8,1,2)

# Vector addition.
add.result <- v1+v2
print(add.result)

# Vector subtraction.
sub.result <- v1-v2
https://www.tutorialspoint.com/r/r_vectors.htm 3/5
11/26/21, 12:39 PM R - Vectors

print(sub.result)

# Vector multiplication.
multi.result <- v1*v2
print(multi.result)

# Vector division.
divi.result <- v1/v2
print(divi.result)

When we execute the above code, it produces the following result −

[1] 7 19 4 13 1 13
[1] -1 -3 4 -3 -1 9
[1] 12 88 0 40 0 22
[1] 0.7500000 0.7272727 Inf 0.6250000 0.0000000 5.5000000

Vector Element Recycling

If we apply arithmetic operations to two vectors of unequal length, then the elements of the
shorter vector are recycled to complete the operations.

Live Demo
v1 <- c(3,8,4,5,0,11)
v2 <- c(4,11)
# V2 becomes c(4,11,4,11,4,11)

add.result <- v1+v2


print(add.result)

sub.result <- v1-v2


print(sub.result)

When we execute the above code, it produces the following result −

[1] 7 19 8 16 4 22
[1] -1 -3 0 -6 -4 0

Vector Element Sorting

Elements in a vector can be sorted using the sort() function.

Live Demo
v <- c(3,8,4,5,0,11, -9, 304)

# Sort the elements of the vector.


sort.result <- sort(v)
https://www.tutorialspoint.com/r/r_vectors.htm 4/5
11/26/21, 12:39 PM R - Vectors

print(sort.result)

# Sort the elements in the reverse order.


revsort.result <- sort(v, decreasing = TRUE)
print(revsort.result)

# Sorting character vectors.


v <- c("Red","Blue","yellow","violet")
sort.result <- sort(v)
print(sort.result)

# Sorting character vectors in reverse order.


revsort.result <- sort(v, decreasing = TRUE)
print(revsort.result)

When we execute the above code, it produces the following result −

[1] -9 0 3 4 5 8 11 304
[1] 304 11 8 5 4 3 0 -9
[1] "Blue" "Red" "violet" "yellow"
[1] "yellow" "violet" "Red" "Blue"

https://www.tutorialspoint.com/r/r_vectors.htm 5/5
11/26/21, 12:40 PM R - Lists

R - Lists

Lists are the R objects which contain elements of different types like − numbers, strings, vectors
and another list inside it. A list can also contain a matrix or a function as its elements. List is
created using list() function.

Creating a List
Following is an example to create a list containing strings, numbers, vectors and a logical
values.

Live Demo
# Create a list containing strings, numbers, vectors and a logical
# values.
list_data <- list("Red", "Green", c(21,32,11), TRUE, 51.23, 119.1)
print(list_data)

When we execute the above code, it produces the following result −

[[1]]
[1] "Red"

[[2]]
[1] "Green"

[[3]]
[1] 21 32 11

[[4]]
[1] TRUE

[[5]]
[1] 51.23

[[6]]
[1] 119.1

Naming List Elements


The list elements can be given names and they can be accessed using these names.

Live Demo

https://www.tutorialspoint.com/r/r_lists.htm 1/5
11/26/21, 12:40 PM R - Lists

# Create a list containing a vector, a matrix and a list.


list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
list("green",12.3))

# Give names to the elements in the list.


names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")

# Show the list.


print(list_data)

When we execute the above code, it produces the following result −

$`1st_Quarter`
[1] "Jan" "Feb" "Mar"

$A_Matrix
[,1] [,2] [,3]
[1,] 3 5 -2
[2,] 9 1 8

$A_Inner_list
$A_Inner_list[[1]]
[1] "green"

$A_Inner_list[[2]]
[1] 12.3

Accessing List Elements

Elements of the list can be accessed by the index of the element in the list. In case of named
lists it can also be accessed using the names.
We continue to use the list in the above example −

Live Demo
# Create a list containing a vector, a matrix and a list.
list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
list("green",12.3))

# Give names to the elements in the list.


names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")

# Access the first element of the list.


print(list_data[1])

# Access the thrid element. As it is also a list, all its elements will be printed.
print(list_data[3])

https://www.tutorialspoint.com/r/r_lists.htm 2/5
11/26/21, 12:40 PM R - Lists

# Access the list element using the name of the element.


print(list_data$A_Matrix)

When we execute the above code, it produces the following result −

$`1st_Quarter`
[1] "Jan" "Feb" "Mar"

$A_Inner_list
$A_Inner_list[[1]]
[1] "green"

$A_Inner_list[[2]]
[1] 12.3

[,1] [,2] [,3]


[1,] 3 5 -2
[2,] 9 1 8

Manipulating List Elements

We can add, delete and update list elements as shown below. We can add and delete elements
only at the end of a list. But we can update any element.

Live Demo
# Create a list containing a vector, a matrix and a list.
list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
list("green",12.3))

# Give names to the elements in the list.


names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")

# Add element at the end of the list.


list_data[4] <- "New element"
print(list_data[4])

# Remove the last element.


list_data[4] <- NULL

# Print the 4th Element.


print(list_data[4])

# Update the 3rd Element.


list_data[3] <- "updated element"
print(list_data[3])

https://www.tutorialspoint.com/r/r_lists.htm 3/5
11/26/21, 12:40 PM R - Lists

When we execute the above code, it produces the following result −

[[1]]
[1] "New element"

$<NA>
NULL

$`A Inner list`


[1] "updated element"

Merging Lists

You can merge many lists into one list by placing all the lists inside one list() function.

Live Demo
# Create two lists.
list1 <- list(1,2,3)
list2 <- list("Sun","Mon","Tue")

# Merge the two lists.


merged.list <- c(list1,list2)

# Print the merged list.


print(merged.list)

When we execute the above code, it produces the following result −

[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

[[4]]
[1] "Sun"

[[5]]
[1] "Mon"

[[6]]
[1] "Tue"

https://www.tutorialspoint.com/r/r_lists.htm 4/5
11/26/21, 12:40 PM R - Lists

Converting List to Vector

A list can be converted to a vector so that the elements of the vector can be used for further
manipulation. All the arithmetic operations on vectors can be applied after the list is converted
into vectors. To do this conversion, we use the unlist() function. It takes the list as input and
produces a vector.

Live Demo
# Create lists.
list1 <- list(1:5)
print(list1)

list2 <-list(10:14)
print(list2)

# Convert the lists to vectors.


v1 <- unlist(list1)
v2 <- unlist(list2)

print(v1)
print(v2)

# Now add the vectors


result <- v1+v2
print(result)

When we execute the above code, it produces the following result −

[[1]]
[1] 1 2 3 4 5

[[1]]
[1] 10 11 12 13 14

[1] 1 2 3 4 5
[1] 10 11 12 13 14
[1] 11 13 15 17 19

https://www.tutorialspoint.com/r/r_lists.htm 5/5
11/26/21, 12:42 PM R - Data Frames

R - Data Frames

A data frame is a table or a two-dimensional array-like structure in which each column contains
values of one variable and each row contains one set of values from each column.

Following are the characteristics of a data frame.

The column names should be non-empty.


The row names should be unique.
The data stored in a data frame can be of numeric, factor or character type.
Each column should contain same number of data items.

Create Data Frame

Live Demo
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)
# Print the data frame.
print(emp.data)

When we execute the above code, it produces the following result −

emp_id emp_name salary start_date


1 1 Rick 623.30 2012-01-01
2 2 Dan 515.20 2013-09-23
3 3 Michelle 611.00 2014-11-15
4 4 Ryan 729.00 2014-05-11
5 5 Gary 843.25 2015-03-27

Get the Structure of the Data Frame

The structure of the data frame can be seen by using str() function.

Live Demo

https://www.tutorialspoint.com/r/r_data_frames.htm 1/6
11/26/21, 12:42 PM R - Data Frames

# Create the data frame.


emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)
# Get the structure of the data frame.
str(emp.data)

When we execute the above code, it produces the following result −

'data.frame': 5 obs. of 4 variables:


$ emp_id : int 1 2 3 4 5
$ emp_name : chr "Rick" "Dan" "Michelle" "Ryan" ...
$ salary : num 623 515 611 729 843
$ start_date: Date, format: "2012-01-01" "2013-09-23" "2014-11-15" "2014-05-11"

Summary of Data in Data Frame

The statistical summary and nature of the data can be obtained by applying summary()
function.

Live Demo
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)
# Print the summary.
print(summary(emp.data))

When we execute the above code, it produces the following result −

emp_id emp_name salary start_date


Min. :1 Length:5 Min. :515.2 Min. :2012-01-01
1st Qu.:2 Class :character 1st Qu.:611.0 1st Qu.:2013-09-23

https://www.tutorialspoint.com/r/r_data_frames.htm 2/6
11/26/21, 12:42 PM R - Data Frames

Median :3 Mode :character Median :623.3 Median :2014-05-11


Mean :3 Mean :664.4 Mean :2014-01-14
3rd Qu.:4 3rd Qu.:729.0 3rd Qu.:2014-11-15
Max. :5 Max. :843.2 Max. :2015-03-27

Extract Data from Data Frame

Extract specific column from a data frame using column name.

Live Demo
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Extract Specific columns.
result <- data.frame(emp.data$emp_name,emp.data$salary)
print(result)

When we execute the above code, it produces the following result −

emp.data.emp_name emp.data.salary
1 Rick 623.30
2 Dan 515.20
3 Michelle 611.00
4 Ryan 729.00
5 Gary 843.25

Extract the first two rows and then all columns

Live Demo
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)
# Extract first two rows.

https://www.tutorialspoint.com/r/r_data_frames.htm 3/6
11/26/21, 12:42 PM R - Data Frames

result <- emp.data[1:2,]


print(result)

When we execute the above code, it produces the following result −

emp_id emp_name salary start_date


1 1 Rick 623.3 2012-01-01
2 2 Dan 515.2 2013-09-23

Extract 3rd and 5th row with 2nd and 4th column

Live Demo
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-1


"2015-03-27")),
stringsAsFactors = FALSE
)

# Extract 3rd and 5th row with 2nd and 4th column.
result <- emp.data[c(3,5),c(2,4)]
print(result)

When we execute the above code, it produces the following result −

emp_name start_date
3 Michelle 2014-11-15
5 Gary 2015-03-27

Expand Data Frame

A data frame can be expanded by adding columns and rows.

Add Column

Just add the column vector using a new column name.

Live Demo
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),

https://www.tutorialspoint.com/r/r_data_frames.htm 4/6
11/26/21, 12:42 PM R - Data Frames

salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)

# Add the "dept" coulmn.


emp.data$dept <- c("IT","Operations","IT","HR","Finance")
v <- emp.data
print(v)

When we execute the above code, it produces the following result −

emp_id emp_name salary start_date dept


1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance

Add Row

To add more rows permanently to an existing data frame, we need to bring in the new rows in
the same structure as the existing data frame and use the rbind() function.

In the example below we create a data frame with new rows and merge it with the existing data
frame to create the final data frame.

Live Demo
# Create the first data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
dept = c("IT","Operations","IT","HR","Finance"),
stringsAsFactors = FALSE
)

# Create the second data frame


emp.newdata <- data.frame(
emp_id = c (6:8),
emp_name = c("Rasmi","Pranab","Tusar"),
salary = c(578.0,722.5,632.8),
start_date = as.Date(c("2013-05-21","2013-07-30","2014-06-17")),
https://www.tutorialspoint.com/r/r_data_frames.htm 5/6
11/26/21, 12:42 PM R - Data Frames

dept = c("IT","Operations","Fianance"),
stringsAsFactors = FALSE
)

# Bind the two data frames.


emp.finaldata <- rbind(emp.data,emp.newdata)
print(emp.finaldata)

When we execute the above code, it produces the following result −

emp_id emp_name salary start_date dept


1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance
6 6 Rasmi 578.00 2013-05-21 IT
7 7 Pranab 722.50 2013-07-30 Operations
8 8 Tusar 632.80 2014-06-17 Fianance

https://www.tutorialspoint.com/r/r_data_frames.htm 6/6
11/26/21, 12:36 PM R - Matrices

R - Matrices

Matrices are the R objects in which the elements are arranged in a two-dimensional rectangular
layout. They contain elements of the same atomic types. Though we can create a matrix
containing only characters or only logical values, they are not of much use. We use matrices
containing numeric elements to be used in mathematical calculations.

A Matrix is created using the matrix() function.

Syntax

The basic syntax for creating a matrix in R is −

matrix(data, nrow, ncol, byrow, dimnames)

Following is the description of the parameters used −


data is the input vector which becomes the data elements of the matrix.

nrow is the number of rows to be created.

ncol is the number of columns to be created.


byrow is a logical clue. If TRUE then the input vector elements are arranged by row.

dimname is the names assigned to the rows and columns.

Example

Create a matrix taking a vector of numbers as input.

Live Demo
# Elements are arranged sequentially by row.
M <- matrix(c(3:14), nrow = 4, byrow = TRUE)
print(M)

# Elements are arranged sequentially by column.


N <- matrix(c(3:14), nrow = 4, byrow = FALSE)
print(N)

# Define the column and row names.


rownames = c("row1", "row2", "row3", "row4")
colnames = c("col1", "col2", "col3")

P <- matrix(c(3:14), nrow = 4, byrow = TRUE, dimnames = list(rownames, colnames))


print(P)

https://www.tutorialspoint.com/r/r_matrices.htm 1/4
11/26/21, 12:36 PM R - Matrices

When we execute the above code, it produces the following result −

[,1] [,2] [,3]


[1,] 3 4 5
[2,] 6 7 8
[3,] 9 10 11
[4,] 12 13 14
[,1] [,2] [,3]
[1,] 3 7 11
[2,] 4 8 12
[3,] 5 9 13
[4,] 6 10 14
col1 col2 col3
row1 3 4 5
row2 6 7 8
row3 9 10 11
row4 12 13 14

Accessing Elements of a Matrix


Elements of a matrix can be accessed by using the column and row index of the element. We
consider the matrix P above to find the specific elements below.

Live Demo
# Define the column and row names.
rownames = c("row1", "row2", "row3", "row4")
colnames = c("col1", "col2", "col3")

# Create the matrix.


P <- matrix(c(3:14), nrow = 4, byrow = TRUE, dimnames = list(rownames, colnames))

# Access the element at 3rd column and 1st row.


print(P[1,3])

# Access the element at 2nd column and 4th row.


print(P[4,2])

# Access only the 2nd row.


print(P[2,])

# Access only the 3rd column.


print(P[,3])

When we execute the above code, it produces the following result −

[1] 5
[1] 13
https://www.tutorialspoint.com/r/r_matrices.htm 2/4
11/26/21, 12:36 PM R - Matrices

col1 col2 col3


6 7 8
row1 row2 row3 row4
5 8 11 14

Matrix Computations

Various mathematical operations are performed on the matrices using the R operators. The
result of the operation is also a matrix.
The dimensions (number of rows and columns) should be same for the matrices involved in the
operation.

Matrix Addition & Subtraction

Live Demo
# Create two 2x3 matrices.
matrix1 <- matrix(c(3, 9, -1, 4, 2, 6), nrow = 2)
print(matrix1)

matrix2 <- matrix(c(5, 2, 0, 9, 3, 4), nrow = 2)


print(matrix2)

# Add the matrices.


result <- matrix1 + matrix2
cat("Result of addition","\n")
print(result)

# Subtract the matrices


result <- matrix1 - matrix2
cat("Result of subtraction","\n")
print(result)

When we execute the above code, it produces the following result −

[,1] [,2] [,3]


[1,] 3 -1 2
[2,] 9 4 6
[,1] [,2] [,3]
[1,] 5 0 3
[2,] 2 9 4
Result of addition
[,1] [,2] [,3]
[1,] 8 -1 5
[2,] 11 13 10
Result of subtraction
[,1] [,2] [,3]

https://www.tutorialspoint.com/r/r_matrices.htm 3/4
11/26/21, 12:36 PM R - Matrices

[1,] -2 -1 -1
[2,] 7 -5 2

Matrix Multiplication & Division

Live Demo
# Create two 2x3 matrices.
matrix1 <- matrix(c(3, 9, -1, 4, 2, 6), nrow = 2)
print(matrix1)

matrix2 <- matrix(c(5, 2, 0, 9, 3, 4), nrow = 2)


print(matrix2)

# Multiply the matrices.


result <- matrix1 * matrix2
cat("Result of multiplication","\n")
print(result)

# Divide the matrices


result <- matrix1 / matrix2
cat("Result of division","\n")
print(result)

When we execute the above code, it produces the following result −

[,1] [,2] [,3]


[1,] 3 -1 2
[2,] 9 4 6
[,1] [,2] [,3]
[1,] 5 0 3
[2,] 2 9 4
Result of multiplication
[,1] [,2] [,3]
[1,] 15 0 6
[2,] 18 36 24
Result of division
[,1] [,2] [,3]
[1,] 0.6 -Inf 0.6666667
[2,] 4.5 0.4444444 1.5000000

https://www.tutorialspoint.com/r/r_matrices.htm 4/4
11/26/21, 12:41 PM R - Factors

R - Factors

Factors are the data objects which are used to categorize the data and store it as levels. They
can store both strings and integers. They are useful in the columns which have a limited number
of unique values. Like "Male, "Female" and True, False etc. They are useful in data analysis for
statistical modeling.

Factors are created using the factor () function by taking a vector as input.

Example

Live Demo
# Create a vector as input.
data <- c("East","West","East","North","North","East","West","West","West","East","

print(data)
print(is.factor(data))

# Apply the factor function.


factor_data <- factor(data)

print(factor_data)
print(is.factor(factor_data))

When we execute the above code, it produces the following result −

[1] "East" "West" "East" "North" "North" "East" "West" "West" "West" "East
[1] FALSE
[1] East West East North North East West West West East North
Levels: East North West
[1] TRUE

Factors in Data Frame


On creating any data frame with a column of text data, R treats the text column as categorical
data and creates factors on it.

Live Demo
# Create the vectors for data frame.
height <- c(132,151,162,139,166,147,122)
weight <- c(48,49,66,53,67,52,40)

https://www.tutorialspoint.com/r/r_factors.htm 1/3
11/26/21, 12:41 PM R - Factors

gender <- c("male","male","female","female","male","female","male")

# Create the data frame.


input_data <- data.frame(height,weight,gender)
print(input_data)

# Test if the gender column is a factor.


print(is.factor(input_data$gender))

# Print the gender column so see the levels.


print(input_data$gender)

When we execute the above code, it produces the following result −

height weight gender


1 132 48 male
2 151 49 male
3 162 66 female
4 139 53 female
5 166 67 male
6 147 52 female
7 122 40 male
[1] TRUE
[1] male male female female male female male
Levels: female male

Changing the Order of Levels

The order of the levels in a factor can be changed by applying the factor function again with new
order of the levels.

Live Demo
data <- c("East","West","East","North","North","East","West",
"West","West","East","North")
# Create the factors
factor_data <- factor(data)
print(factor_data)

# Apply the factor function with required order of the level.


new_order_data <- factor(factor_data,levels = c("East","West","North"))
print(new_order_data)

When we execute the above code, it produces the following result −

[1] East West East North North East West West West East North
Levels: East North West

https://www.tutorialspoint.com/r/r_factors.htm 2/3
11/26/21, 12:41 PM R - Factors

[1] East West East North North East West West West East North
Levels: East West North

Generating Factor Levels

We can generate factor levels by using the gl() function. It takes two integers as input which
indicates how many levels and how many times each level.

Syntax

gl(n, k, labels)

Following is the description of the parameters used −

n is a integer giving the number of levels.


k is a integer giving the number of replications.

labels is a vector of labels for the resulting factor levels.

Example

Live Demo
v <- gl(3, 4, labels = c("Tampa", "Seattle","Boston"))
print(v)

When we execute the above code, it produces the following result −

Tampa Tampa Tampa Tampa Seattle Seattle Seattle Seattle Boston


[10] Boston Boston Boston
Levels: Tampa Seattle Boston

https://www.tutorialspoint.com/r/r_factors.htm 3/3
11/26/21, 12:48 PM R - Data Reshaping

R - Data Reshaping

Data Reshaping in R is about changing the way data is organized into rows and columns. Most
of the time data processing in R is done by taking the input data as a data frame. It is easy to
extract data from the rows and columns of a data frame but there are situations when we need
the data frame in a format that is different from format in which we received it. R has many
functions to split, merge and change the rows to columns and vice-versa in a data frame.

Joining Columns and Rows in a Data Frame


We can join multiple vectors to create a data frame using the cbind()function. Also we can
merge two data frames using rbind() function.

Live Demo
# Create vector objects.
city <- c("Tampa","Seattle","Hartford","Denver")
state <- c("FL","WA","CT","CO")
zipcode <- c(33602,98104,06161,80294)

# Combine above three vectors into one data frame.


addresses <- cbind(city,state,zipcode)

# Print a header.
cat("# # # # The First data frame\n")

# Print the data frame.


print(addresses)

# Create another data frame with similar columns


new.address <- data.frame(
city = c("Lowry","Charlotte"),
state = c("CO","FL"),
zipcode = c("80230","33949"),
stringsAsFactors = FALSE
)

# Print a header.
cat("# # # The Second data frame\n")

# Print the data frame.


print(new.address)

# Combine rows form both the data frames.

https://www.tutorialspoint.com/r/r_data_reshaping.htm 1/6
11/26/21, 12:48 PM R - Data Reshaping

all.addresses <- rbind(addresses,new.address)

# Print a header.
cat("# # # The combined data frame\n")

# Print the result.


print(all.addresses)

When we execute the above code, it produces the following result −

# # # # The First data frame


city state zipcode
[1,] "Tampa" "FL" "33602"
[2,] "Seattle" "WA" "98104"
[3,] "Hartford" "CT" "6161"
[4,] "Denver" "CO" "80294"

# # # The Second data frame


city state zipcode
1 Lowry CO 80230
2 Charlotte FL 33949

# # # The combined data frame


city state zipcode
1 Tampa FL 33602
2 Seattle WA 98104
3 Hartford CT 6161
4 Denver CO 80294
5 Lowry CO 80230
6 Charlotte FL 33949

Merging Data Frames


We can merge two data frames by using the merge() function. The data frames must have
same column names on which the merging happens.
In the example below, we consider the data sets about Diabetes in Pima Indian Women
available in the library names "MASS". we merge the two data sets based on the values of
blood pressure("bp") and body mass index("bmi"). On choosing these two columns for merging,
the records where values of these two variables match in both data sets are combined together
to form a single data frame.

Live Demo
library(MASS)
merged.Pima <- merge(x = Pima.te, y = Pima.tr,
by.x = c("bp", "bmi"),
by.y = c("bp", "bmi")

https://www.tutorialspoint.com/r/r_data_reshaping.htm 2/6
11/26/21, 12:48 PM R - Data Reshaping

)
print(merged.Pima)
nrow(merged.Pima)

When we execute the above code, it produces the following result −

bp bmi npreg.x glu.x skin.x ped.x age.x type.x npreg.y glu.y skin.y ped.y
1 60 33.8 1 117 23 0.466 27 No 2 125 20 0.088
2 64 29.7 2 75 24 0.370 33 No 2 100 23 0.368
3 64 31.2 5 189 33 0.583 29 Yes 3 158 13 0.295
4 64 33.2 4 117 27 0.230 24 No 1 96 27 0.289
5 66 38.1 3 115 39 0.150 28 No 1 114 36 0.289
6 68 38.5 2 100 25 0.324 26 No 7 129 49 0.439
7 70 27.4 1 116 28 0.204 21 No 0 124 20 0.254
8 70 33.1 4 91 32 0.446 22 No 9 123 44 0.374
9 70 35.4 9 124 33 0.282 34 No 6 134 23 0.542
10 72 25.6 1 157 21 0.123 24 No 4 99 17 0.294
11 72 37.7 5 95 33 0.370 27 No 6 103 32 0.324
12 74 25.9 9 134 33 0.460 81 No 8 126 38 0.162
13 74 25.9 1 95 21 0.673 36 No 8 126 38 0.162
14 78 27.6 5 88 30 0.258 37 No 6 125 31 0.565
15 78 27.6 10 122 31 0.512 45 No 6 125 31 0.565
16 78 39.4 2 112 50 0.175 24 No 4 112 40 0.236
17 88 34.5 1 117 24 0.403 40 Yes 4 127 11 0.598
age.y type.y
1 31 No
2 21 No
3 24 No
4 21 No
5 21 No
6 43 Yes
7 36 Yes
8 40 No
9 29 Yes
10 28 No
11 55 No
12 39 No
13 39 No
14 49 Yes
15 49 Yes
16 38 No
17 28 No
[1] 17

Melting and Casting

https://www.tutorialspoint.com/r/r_data_reshaping.htm 3/6
11/26/21, 12:48 PM R - Data Reshaping

One of the most interesting aspects of R programming is about changing the shape of the data
in multiple steps to get a desired shape. The functions used to do this are called melt() and
cast().

We consider the dataset called ships present in the library called "MASS".

Live Demo
library(MASS)
print(ships)

When we execute the above code, it produces the following result −

type year period service incidents


1 A 60 60 127 0
2 A 60 75 63 0
3 A 65 60 1095 3
4 A 65 75 1095 4
5 A 70 60 1512 6
.............
.............
8 A 75 75 2244 11
9 B 60 60 44882 39
10 B 60 75 17176 29
11 B 65 60 28609 58
............
............
17 C 60 60 1179 1
18 C 60 75 552 1
19 C 65 60 781 0
............
............

Melt the Data

Now we melt the data to organize it, converting all columns other than type and year into
multiple rows.

molten.ships <- melt(ships, id = c("type","year"))


print(molten.ships)

When we execute the above code, it produces the following result −

type year variable value


1 A 60 period 60
2 A 60 period 75
3 A 65 period 60
4 A 65 period 75

https://www.tutorialspoint.com/r/r_data_reshaping.htm 4/6
11/26/21, 12:48 PM R - Data Reshaping

............
............
9 B 60 period 60
10 B 60 period 75
11 B 65 period 60
12 B 65 period 75
13 B 70 period 60
...........
...........
41 A 60 service 127
42 A 60 service 63
43 A 65 service 1095
...........
...........
70 D 70 service 1208
71 D 75 service 0
72 D 75 service 2051
73 E 60 service 45
74 E 60 service 0
75 E 65 service 789
...........
...........
101 C 70 incidents 6
102 C 70 incidents 2
103 C 75 incidents 0
104 C 75 incidents 1
105 D 60 incidents 0
106 D 60 incidents 0
...........
...........

Cast the Molten Data

We can cast the molten data into a new form where the aggregate of each type of ship for each
year is created. It is done using the cast() function.

recasted.ship <- cast(molten.ships, type+year~variable,sum)


print(recasted.ship)

When we execute the above code, it produces the following result −

type year period service incidents


1 A 60 135 190 0
2 A 65 135 2190 7
3 A 70 135 4865 24
4 A 75 135 2244 11
5 B 60 135 62058 68

https://www.tutorialspoint.com/r/r_data_reshaping.htm 5/6
11/26/21, 12:48 PM R - Data Reshaping

6 B 65 135 48979 111


7 B 70 135 20163 56
8 B 75 135 7117 18
9 C 60 135 1731 2
10 C 65 135 1457 1
11 C 70 135 2731 8
12 C 75 135 274 1
13 D 60 135 356 0
14 D 65 135 480 0
15 D 70 135 1557 13
16 D 75 135 2051 4
17 E 60 135 45 0
18 E 65 135 1226 14
19 E 70 135 3318 17
20 E 75 135 542 1

https://www.tutorialspoint.com/r/r_data_reshaping.htm 6/6
11/26/21, 12:53 PM R - Linear Regression

R - Linear Regression

Regression analysis is a very widely used statistical tool to establish a relationship model
between two variables. One of these variable is called predictor variable whose value is
gathered through experiments. The other variable is called response variable whose value is
derived from the predictor variable.

In Linear Regression these two variables are related through an equation, where exponent
(power) of both these variables is 1. Mathematically a linear relationship represents a straight
line when plotted as a graph. A non-linear relationship where the exponent of any variable is not
equal to 1 creates a curve.

The general mathematical equation for a linear regression is −

y = ax + b

Following is the description of the parameters used −


y is the response variable.

x is the predictor variable.

a and b are constants which are called the coefficients.

Steps to Establish a Regression


A simple example of regression is predicting weight of a person when his height is known. To do
this we need to have the relationship between height and weight of a person.
The steps to create the relationship is −

Carry out the experiment of gathering a sample of observed values of height and
corresponding weight.
Create a relationship model using the lm() functions in R.

Find the coefficients from the model created and create the mathematical equation
using these

Get a summary of the relationship model to know the average error in prediction. Also
called residuals.

To predict the weight of new persons, use the predict() function in R.

Input Data

Below is the sample data representing the observations −

https://www.tutorialspoint.com/r/r_linear_regression.htm 1/5
11/26/21, 12:53 PM R - Linear Regression

# Values of height
151, 174, 138, 186, 128, 136, 179, 163, 152, 131

# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48

lm() Function
This function creates the relationship model between the predictor and the response variable.

Syntax

The basic syntax for lm() function in linear regression is −

lm(formula,data)

Following is the description of the parameters used −

formula is a symbol presenting the relation between x and y.


data is the vector on which the formula will be applied.

Create Relationship Model & get the Coefficients

Live Demo
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.


relation <- lm(y~x)

print(relation)

When we execute the above code, it produces the following result −

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept) x
-38.4551 0.6746

Get the Summary of the Relationship

Live Demo
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
https://www.tutorialspoint.com/r/r_linear_regression.htm 2/5
11/26/21, 12:53 PM R - Linear Regression

# Apply the lm() function.


relation <- lm(y~x)

print(summary(relation))

When we execute the above code, it produces the following result −

Call:
lm(formula = y ~ x)

Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.253 on 8 degrees of freedom


Multiple R-squared: 0.9548, Adjusted R-squared: 0.9491
F-statistic: 168.9 on 1 and 8 DF, p-value: 1.164e-06

predict() Function

Syntax

The basic syntax for predict() in linear regression is −

predict(object, newdata)

Following is the description of the parameters used −

object is the formula which is already created using the lm() function.

newdata is the vector containing the new value for predictor variable.

Predict the weight of new persons

Live Demo
# The predictor vector.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)

# The resposne vector.

https://www.tutorialspoint.com/r/r_linear_regression.htm 3/5
11/26/21, 12:53 PM R - Linear Regression

y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.


relation <- lm(y~x)

# Find weight of a person with height 170.


a <- data.frame(x = 170)
result <- predict(relation,a)
print(result)

When we execute the above code, it produces the following result −

1
76.22869

Visualize the Regression Graphically

Live Demo
# Create the predictor and response variable.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
relation <- lm(y~x)

# Give the chart file a name.


png(file = "linearregression.png")

# Plot the chart.


plot(y,x,col = "blue",main = "Height & Weight Regression",
abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in Kg",ylab = "Height in cm")

# Save the file.


dev.off()

When we execute the above code, it produces the following result −

https://www.tutorialspoint.com/r/r_linear_regression.htm 4/5
11/26/21, 12:53 PM R - Linear Regression

https://www.tutorialspoint.com/r/r_linear_regression.htm 5/5
11/26/21, 12:53 PM R - Multiple Regression

R - Multiple Regression

Multiple regression is an extension of linear regression into relationship between more than two
variables. In simple linear relation we have one predictor and one response variable, but in
multiple regression we have more than one predictor variable and one response variable.

The general mathematical equation for multiple regression is −

y = a + b1x1 + b2x2 +...bnxn

Following is the description of the parameters used −

y is the response variable.

a, b1, b2...bn are the coefficients.


x1, x2, ...xn are the predictor variables.

We create the regression model using the lm() function in R. The model determines the value of
the coefficients using the input data. Next we can predict the value of the response variable for
a given set of predictor variables using these coefficients.

lm() Function
This function creates the relationship model between the predictor and the response variable.

Syntax

The basic syntax for lm() function in multiple regression is −

lm(y ~ x1+x2+x3...,data)

Following is the description of the parameters used −

formula is a symbol presenting the relation between the response variable and
predictor variables.

data is the vector on which the formula will be applied.

Example

Input Data

Consider the data set "mtcars" available in the R environment. It gives a comparison between
different car models in terms of mileage per gallon (mpg), cylinder displacement("disp"), horse
power("hp"), weight of the car("wt") and some more parameters.

https://www.tutorialspoint.com/r/r_multiple_regression.htm 1/3
11/26/21, 12:53 PM R - Multiple Regression

The goal of the model is to establish the relationship between "mpg" as a response variable with
"disp","hp" and "wt" as predictor variables. We create a subset of these variables from the
mtcars data set for this purpose.

Live Demo
input <- mtcars[,c("mpg","disp","hp","wt")]
print(head(input))

When we execute the above code, it produces the following result −

mpg disp hp wt
Mazda RX4 21.0 160 110 2.620
Mazda RX4 Wag 21.0 160 110 2.875
Datsun 710 22.8 108 93 2.320
Hornet 4 Drive 21.4 258 110 3.215
Hornet Sportabout 18.7 360 175 3.440
Valiant 18.1 225 105 3.460

Create Relationship Model & get the Coefficients

Live Demo
input <- mtcars[,c("mpg","disp","hp","wt")]

# Create the relationship model.


model <- lm(mpg~disp+hp+wt, data = input)

# Show the model.


print(model)

# Get the Intercept and coefficients as vector elements.


cat("# # # # The Coefficient Values # # # ","\n")

a <- coef(model)[1]
print(a)

Xdisp <- coef(model)[2]


Xhp <- coef(model)[3]
Xwt <- coef(model)[4]

print(Xdisp)
print(Xhp)
print(Xwt)

When we execute the above code, it produces the following result −

https://www.tutorialspoint.com/r/r_multiple_regression.htm 2/3
11/26/21, 12:53 PM R - Multiple Regression

Call:
lm(formula = mpg ~ disp + hp + wt, data = input)

Coefficients:
(Intercept) disp hp wt
37.105505 -0.000937 -0.031157 -3.800891

# # # # The Coefficient Values # # #


(Intercept)
37.10551
disp
-0.0009370091
hp
-0.03115655
wt
-3.800891

Create Equation for Regression Model

Based on the above intercept and coefficient values, we create the mathematical equation.

Y = a+Xdisp.x1+Xhp.x2+Xwt.x3
or
Y = 37.15+(-0.000937)*x1+(-0.0311)*x2+(-3.8008)*x3

Apply Equation for predicting New Values

We can use the regression equation created above to predict the mileage when a new set of
values for displacement, horse power and weight is provided.
For a car with disp = 221, hp = 102 and wt = 2.91 the predicted mileage is −

Y = 37.15+(-0.000937)*221+(-0.0311)*102+(-3.8008)*2.91 = 22.7104

https://www.tutorialspoint.com/r/r_multiple_regression.htm 3/3
11/26/21, 12:59 PM R - Chi Square Test

R - Chi Square Test

Chi-Square test is a statistical method to determine if two categorical variables have a


significant correlation between them. Both those variables should be from same population and
they should be categorical like − Yes/No, Male/Female, Red/Green etc.

For example, we can build a data set with observations on people's ice-cream buying pattern
and try to correlate the gender of a person with the flavor of the ice-cream they prefer. If a
correlation is found we can plan for appropriate stock of flavors by knowing the number of
gender of people visiting.

Syntax
The function used for performing chi-Square test is chisq.test().

The basic syntax for creating a chi-square test in R is −

chisq.test(data)

Following is the description of the parameters used −

data is the data in form of a table containing the count value of the variables in the
observation.

Example
We will take the Cars93 data in the "MASS" library which represents the sales of different
models of car in the year 1993.

Live Demo
library("MASS")
print(str(Cars93))

When we execute the above code, it produces the following result −

'data.frame': 93 obs. of 27 variables:


$ Manufacturer : Factor w/ 32 levels "Acura","Audi",..: 1 1 2 2 3 4 4 4 4 5
$ Model : Factor w/ 93 levels "100","190E","240",..: 49 56 9 1 6 24
$ Type : Factor w/ 6 levels "Compact","Large",..: 4 3 1 3 3 3 2 2 3
$ Min.Price : num 12.9 29.2 25.9 30.8 23.7 14.2 19.9 22.6 26.3 33 ...
$ Price : num 15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ...
$ Max.Price : num 18.8 38.7 32.3 44.6 36.2 17.3 21.7 24.9 26.3 36.3 ...
$ MPG.city : int 25 18 20 19 22 22 19 16 19 16 ...
$ MPG.highway : int 31 25 26 26 30 31 28 25 27 25 ...
$ AirBags : Factor w/ 3 levels "Driver & Passenger",..:
Passenger" : 3 1 2 1 2 2 2
https://www.tutorialspoint.com/r/r_chi_square_tests.htm 1/3
11/26/21, 12:59 PM R - Chi Square Test
$ AirBags : Factor w/ 3 levels Driver & Passenger ,..: 3 1 2 1 2 2 2
$ DriveTrain : Factor w/ 3 levels "4WD","Front",..: 2 2 2 2 3 2 2 3 2 2 .
$ Cylinders : Factor w/ 6 levels "3","4","5","6",..: 2 4 4 4 2 2 4 4 4 5
$ EngineSize : num 1.8 3.2 2.8 2.8 3.5 2.2 3.8 5.7 3.8 4.9 ...
$ Horsepower : int 140 200 172 172 208 110 170 180 170 200 ...
$ RPM : int 6300 5500 5500 5500 5700 5200 4800 4000 4800 4100 ...
$ Rev.per.mile : int 2890 2335 2280 2535 2545 2565 1570 1320 1690 1510 ...
$ Man.trans.avail : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 1 1 ...
$ Fuel.tank.capacity: num 13.2 18 16.9 21.1 21.1 16.4 18 23 18.8 18 ...
$ Passengers : int 5 5 5 6 4 6 6 6 5 6 ...
$ Length : int 177 195 180 193 186 189 200 216 198 206 ...
$ Wheelbase : int 102 115 102 106 109 105 111 116 108 114 ...
$ Width : int 68 71 67 70 69 69 74 78 73 73 ...
$ Turn.circle : int 37 38 37 37 39 41 42 45 41 43 ...
$ Rear.seat.room : num 26.5 30 28 31 27 28 30.5 30.5 26.5 35 ...
$ Luggage.room : int 11 15 14 17 13 16 17 21 14 18 ...
$ Weight : int 2705 3560 3375 3405 3640 2880 3470 4105 3495 3620 ...
$ Origin : Factor w/ 2 levels "USA","non-USA": 2 2 2 2 2 1 1 1 1 1 ..
$ Make : Factor w/ 93 levels "Acura Integra",..: 1 2 4 3 5 6 7 9 8

The above result shows the dataset has many Factor variables which can be considered as
categorical variables. For our model we will consider the variables "AirBags" and "Type". Here
we aim to find out any significant correlation between the types of car sold and the type of Air
bags it has. If correlation is observed we can estimate which types of cars can sell better with
what types of air bags.

Live Demo
# Load the library.
library("MASS")

# Create a data frame from the main data set.


car.data <- data.frame(Cars93$AirBags, Cars93$Type)

# Create a table with the needed variables.


car.data = table(Cars93$AirBags, Cars93$Type)
print(car.data)

# Perform the Chi-Square test.


print(chisq.test(car.data))

When we execute the above code, it produces the following result −

Compact Large Midsize Small Sporty Van


Driver & Passenger 2 4 7 0 3 0
Driver only 9 7 11 5 8 3
None 5 0 4 16 3 6

https://www.tutorialspoint.com/r/r_chi_square_tests.htm 2/3
11/26/21, 12:59 PM R - Chi Square Test

Pearson's Chi-squared test

data: car.data
X-squared = 33.001, df = 10, p-value = 0.0002723

Warning message:
In chisq.test(car.data) : Chi-squared approximation may be incorrect

Conclusion
The result shows the p-value of less than 0.05 which indicates a string correlation.

https://www.tutorialspoint.com/r/r_chi_square_tests.htm 3/3
11/26/21, 12:54 PM R - Binomial Distribution

R - Binomial Distribution

The binomial distribution model deals with finding the probability of success of an event which
has only two possible outcomes in a series of experiments. For example, tossing of a coin
always gives a head or a tail. The probability of finding exactly 3 heads in tossing a coin
repeatedly for 10 times is estimated during the binomial distribution.

R has four in-built functions to generate binomial distribution. They are described below.

dbinom(x, size, prob)


pbinom(x, size, prob)
qbinom(p, size, prob)
rbinom(n, size, prob)

Following is the description of the parameters used −


x is a vector of numbers.

p is a vector of probabilities.

n is number of observations.

size is the number of trials.


prob is the probability of success of each trial.

dbinom()
This function gives the probability density distribution at each point.

Live Demo
# Create a sample of 50 numbers which are incremented by 1.
x <- seq(0,50,by = 1)

# Create the binomial distribution.


y <- dbinom(x,50,0.5)

# Give the chart file a name.


png(file = "dbinom.png")

# Plot the graph for this sample.


plot(x,y)

# Save the file.


dev.off()

https://www.tutorialspoint.com/r/r_binomial_distribution.htm 1/3
11/26/21, 12:54 PM R - Binomial Distribution

When we execute the above code, it produces the following result −

pbinom()

This function gives the cumulative probability of an event. It is a single value representing the
probability.

Live Demo
# Probability of getting 26 or less heads from a 51 tosses of a coin.
x <- pbinom(26,51,0.5)

print(x)

When we execute the above code, it produces the following result −

[1] 0.610116

qbinom()

This function takes the probability value and gives a number whose cumulative value matches
the probability value.

Live Demo

https://www.tutorialspoint.com/r/r_binomial_distribution.htm 2/3
11/26/21, 12:54 PM R - Binomial Distribution

# How many heads will have a probability of 0.25 will come out when a coin
# is tossed 51 times.
x <- qbinom(0.25,51,1/2)

print(x)

When we execute the above code, it produces the following result −

[1] 23

rbinom()

This function generates required number of random values of given probability from a given
sample.

Live Demo
# Find 8 random values from a sample of 150 with probability of 0.4.
x <- rbinom(8,150,.4)

print(x)

When we execute the above code, it produces the following result −

[1] 58 61 59 66 55 60 61 67

https://www.tutorialspoint.com/r/r_binomial_distribution.htm 3/3
11/26/21, 12:54 PM R - Normal Distribution

R - Normal Distribution

In a random collection of data from independent sources, it is generally observed that the
distribution of data is normal. Which means, on plotting a graph with the value of the variable in
the horizontal axis and the count of the values in the vertical axis we get a bell shape curve. The
center of the curve represents the mean of the data set. In the graph, fifty percent of values lie
to the left of the mean and the other fifty percent lie to the right of the graph. This is referred as
normal distribution in statistics.

R has four in built functions to generate normal distribution. They are described below.

dnorm(x, mean, sd)


pnorm(x, mean, sd)
qnorm(p, mean, sd)
rnorm(n, mean, sd)

Following is the description of the parameters used in above functions −

x is a vector of numbers.

p is a vector of probabilities.
n is number of observations(sample size).

mean is the mean value of the sample data. It's default value is zero.

sd is the standard deviation. It's default value is 1.

dnorm()
This function gives height of the probability distribution at each point for a given mean and
standard deviation.

Live Demo
# Create a sequence of numbers between -10 and 10 incrementing by 0.1.
x <- seq(-10, 10, by = .1)

# Choose the mean as 2.5 and standard deviation as 0.5.


y <- dnorm(x, mean = 2.5, sd = 0.5)

# Give the chart file a name.


png(file = "dnorm.png")

plot(x,y)

https://www.tutorialspoint.com/r/r_normal_distribution.htm 1/5
11/26/21, 12:54 PM R - Normal Distribution

# Save the file.


dev.off()

When we execute the above code, it produces the following result −

pnorm()
This function gives the probability of a normally distributed random number to be less that the
value of a given number. It is also called "Cumulative Distribution Function".

Live Demo
# Create a sequence of numbers between -10 and 10 incrementing by 0.2.
x <- seq(-10,10,by = .2)

# Choose the mean as 2.5 and standard deviation as 2.


y <- pnorm(x, mean = 2.5, sd = 2)

# Give the chart file a name.


png(file = "pnorm.png")

# Plot the graph.


plot(x,y)

https://www.tutorialspoint.com/r/r_normal_distribution.htm 2/5
11/26/21, 12:54 PM R - Normal Distribution

# Save the file.


dev.off()

When we execute the above code, it produces the following result −

qnorm()
This function takes the probability value and gives a number whose cumulative value matches
the probability value.

Live Demo
# Create a sequence of probability values incrementing by 0.02.
x <- seq(0, 1, by = 0.02)

# Choose the mean as 2 and standard deviation as 3.


y <- qnorm(x, mean = 2, sd = 1)

# Give the chart file a name.


png(file = "qnorm.png")

# Plot the graph.


plot(x,y)

https://www.tutorialspoint.com/r/r_normal_distribution.htm 3/5
11/26/21, 12:54 PM R - Normal Distribution

# Save the file.


dev.off()

When we execute the above code, it produces the following result −

rnorm()

This function is used to generate random numbers whose distribution is normal. It takes the
sample size as input and generates that many random numbers. We draw a histogram to show
the distribution of the generated numbers.

Live Demo
# Create a sample of 50 numbers which are normally distributed.
y <- rnorm(50)

# Give the chart file a name.


png(file = "rnorm.png")

# Plot the histogram for this sample.


hist(y, main = "Normal DIstribution")

# Save the file.


dev.off()

https://www.tutorialspoint.com/r/r_normal_distribution.htm 4/5
11/26/21, 12:54 PM R - Normal Distribution

When we execute the above code, it produces the following result −

https://www.tutorialspoint.com/r/r_normal_distribution.htm 5/5
11/26/21, 12:56 PM R - Time Series Analysis

R - Time Series Analysis

Time series is a series of data points in which each data point is associated with a timestamp. A
simple example is the price of a stock in the stock market at different points of time on a given
day. Another example is the amount of rainfall in a region at different months of the year. R
language uses many functions to create, manipulate and plot the time series data. The data for
the time series is stored in an R object called time-series object. It is also a R data object like a
vector or data frame.

The time series object is created by using the ts() function.

Syntax

The basic syntax for ts() function in time series analysis is −

timeseries.object.name <- ts(data, start, end, frequency)

Following is the description of the parameters used −

data is a vector or matrix containing the values used in the time series.
start specifies the start time for the first observation in time series.

end specifies the end time for the last observation in time series.

frequency specifies the number of observations per unit time.

Except the parameter "data" all other parameters are optional.

Example

Consider the annual rainfall details at a place starting from January 2012. We create an R time
series object for a period of 12 months and plot it.

Live Demo
# Get the data points in form of a R vector.
rainfall <- c(799,1174.8,865.1,1334.6,635.4,918.5,685.5,998.6,784.2,985,882.8,1071)

# Convert it to a time series object.


rainfall.timeseries <- ts(rainfall,start = c(2012,1),frequency = 12)

# Print the timeseries data.


print(rainfall.timeseries)

# Give the chart file a name.


png(file = "rainfall.png")

https://www.tutorialspoint.com/r/r_time_series_analysis.htm 1/4
11/26/21, 12:56 PM R - Time Series Analysis

# Plot a graph of the time series.


plot(rainfall.timeseries)

# Save the file.


dev.off()

When we execute the above code, it produces the following result and chart −

Jan Feb Mar Apr May Jun Jul Aug Sep


2012 799.0 1174.8 865.1 1334.6 635.4 918.5 685.5 998.6 784.2
Oct Nov Dec
2012 985.0 882.8 1071.0

The Time series chart −

Different Time Intervals

The value of the frequency parameter in the ts() function decides the time intervals at which the
data points are measured. A value of 12 indicates that the time series is for 12 months. Other
values and its meaning is as below −

frequency = 12 pegs the data points for every month of a year.

https://www.tutorialspoint.com/r/r_time_series_analysis.htm 2/4
11/26/21, 12:56 PM R - Time Series Analysis

frequency = 4 pegs the data points for every quarter of a year.


frequency = 6 pegs the data points for every 10 minutes of an hour.

frequency = 24*6 pegs the data points for every 10 minutes of a day.

Multiple Time Series

We can plot multiple time series in one chart by combining both the series into a matrix.

Live Demo
# Get the data points in form of a R vector.
rainfall1 <- c(799,1174.8,865.1,1334.6,635.4,918.5,685.5,998.6,784.2,985,882.8,1071
rainfall2 <-
c(655,1306.9,1323.4,1172.2,562.2,824,822.4,1265.5,799.6,1105.6,1106.7,13

# Convert them to a matrix.


combined.rainfall <- matrix(c(rainfall1,rainfall2),nrow = 12)

# Convert it to a time series object.


rainfall.timeseries <- ts(combined.rainfall,start = c(2012,1),frequency = 12)

# Print the timeseries data.


print(rainfall.timeseries)

# Give the chart file a name.


png(file = "rainfall_combined.png")

# Plot a graph of the time series.


plot(rainfall.timeseries, main = "Multiple Time Series")

# Save the file.


dev.off()

When we execute the above code, it produces the following result and chart −

Series 1 Series 2
Jan 2012 799.0 655.0
Feb 2012 1174.8 1306.9
Mar 2012 865.1 1323.4
Apr 2012 1334.6 1172.2
May 2012 635.4 562.2
Jun 2012 918.5 824.0
Jul 2012 685.5 822.4
Aug 2012 998.6 1265.5
Sep 2012 784.2 799.6
Oct 2012 985.0 1105.6

https://www.tutorialspoint.com/r/r_time_series_analysis.htm 3/4
11/26/21, 12:56 PM R - Time Series Analysis

Nov 2012 882.8 1106.7


Dec 2012 1071.0 1337.8

The Multiple Time series chart −

https://www.tutorialspoint.com/r/r_time_series_analysis.htm 4/4
11/26/21, 12:47 PM R - Packages

R - Packages

R packages are a collection of R functions, complied code and sample data. They are stored
under a directory called "library" in the R environment. By default, R installs a set of packages
during installation. More packages are added later, when they are needed for some specific
purpose. When we start the R console, only the default packages are available by default. Other
packages which are already installed have to be loaded explicitly to be used by the R program
that is going to use them.

All the packages available in R language are listed at R Packages.

Below is a list of commands to be used to check, verify and use the R packages.

Check Available R Packages


Get library locations containing R packages

Live Demo
.libPaths()

When we execute the above code, it produces the following result. It may vary depending on the
local settings of your pc.

[2] "C:/Program Files/R/R-3.2.2/library"

Get the list of all the packages installed

Live Demo
library()

When we execute the above code, it produces the following result. It may vary depending on the
local settings of your pc.

Packages in library ‘C:/Program Files/R/R-3.2.2/library’:

base The R Base Package


boot Bootstrap Functions (Originally by Angelo Canty
for S)
class Functions for Classification
cluster "Finding Groups in Data": Cluster Analysis
Extended Rousseeuw et al.
codetools Code Analysis Tools for R
compiler The R Compiler Package
https://www.tutorialspoint.com/r/r_packages.htm 1/3
11/26/21, 12:47 PM R - Packages

datasets The R Datasets Package


foreign Read Data Stored by 'Minitab', 'S', 'SAS',
'SPSS', 'Stata', 'Systat', 'Weka', 'dBase', ...
graphics The R Graphics Package
grDevices The R Graphics Devices and Support for Colours
and Fonts
grid The Grid Graphics Package
KernSmooth Functions for Kernel Smoothing Supporting Wand
& Jones (1995)
lattice Trellis Graphics for R
MASS Support Functions and Datasets for Venables and
Ripley's MASS
Matrix Sparse and Dense Matrix Classes and Methods
methods Formal Methods and Classes
mgcv Mixed GAM Computation Vehicle with GCV/AIC/REML
Smoothness Estimation
nlme Linear and Nonlinear Mixed Effects Models
nnet Feed-Forward Neural Networks and Multinomial
Log-Linear Models
parallel Support for Parallel computation in R
rpart Recursive Partitioning and Regression Trees
spatial Functions for Kriging and Point Pattern
Analysis
splines Regression Spline Functions and Classes
stats The R Stats Package
stats4 Statistical Functions using S4 Classes
survival Survival Analysis
tcltk Tcl/Tk Interface
tools Tools for Package Development
utils The R Utils Package

Get all packages currently loaded in the R environment

Live Demo
search()

When we execute the above code, it produces the following result. It may vary depending on the
local settings of your pc.

[1] ".GlobalEnv" "package:stats" "package:graphics"


[4] "package:grDevices" "package:utils" "package:datasets"
[7] "package:methods" "Autoloads" "package:base"

Install a New Package


There are two ways to add new R packages. One is installing directly from the CRAN directory
and another is downloading the package to your local system and installing it manually.
https://www.tutorialspoint.com/r/r_packages.htm 2/3
11/26/21, 12:47 PM R - Packages

Install directly from CRAN


The following command gets the packages directly from CRAN webpage and installs the
package in the R environment. You may be prompted to choose a nearest mirror. Choose the
one appropriate to your location.

install.packages("Package Name")

# Install the package named "XML".


install.packages("XML")

Install package manually

Go to the link R Packages to download the package needed. Save the package as a .zip file
in a suitable location in the local system.

Now you can run the following command to install this package in the R environment.

install.packages(file_name_with_path, repos = NULL, type = "source")

# Install the package named "XML"


install.packages("E:/XML_3.98-1.3.zip", repos = NULL, type = "source")

Load Package to Library


Before a package can be used in the code, it must be loaded to the current R environment. You
also need to load a package that is already installed previously but not available in the current
environment.
A package is loaded using the following command −

library("package Name", lib.loc = "path to library")

# Load the package named "XML"


install.packages("E:/XML_3.98-1.3.zip", repos = NULL, type = "source")

https://www.tutorialspoint.com/r/r_packages.htm 3/3
R Reference Card character or factor columns are surrounded by quotes ("); sep is the Indexing lists
field separator; eol is the end-of-line separator; na is the string for x[n] list with elements n
by Tom Short, EPRI PEAC, [email protected] 2004-11-07 missing values; use col.names=NA to add a blank column header to x[[n]] nth element of the list
Granted to the public domain. See www.Rpad.org for the source and latest get the column headers aligned correctly for spreadsheet input x[["name"]] element of the list named "name"
version. Includes material from R for Beginners by Emmanuel Paradis (with sink(file) output to file, until sink() x$name id.
permission). Most of the I/O functions have a file argument. This can often be a charac- Indexing matrices
ter string naming a file or a connection. file="" means the standard input or x[i,j] element at row i, column j
output. Connections can include files, pipes, zipped files, and R variables. x[i,] row i
On windows, the file connection can also be used with description = x[,j] column j
Getting help "clipboard". To read a table copied from Excel, use x[,c(1,3)] columns 1 and 3
x <- read.delim("clipboard") x["name",] row named "name"
Most R functions have online documentation. Indexing data frames (matrix indexing plus the following)
To write a table to the clipboard for Excel, use
help(topic) documentation on topic x[["name"]] column named "name"
write.table(x,"clipboard",sep="\t",col.names=NA)
?topic id. id.
For database interaction, see packages RODBC, DBI, RMySQL, RPgSQL, and x$name
help.search("topic") search the help system
ROracle. See packages XML, hdf5, netCDF for reading other file formats.
apropos("topic") the names of all objects in the search list matching
the regular expression ”topic” Data creation
help.start() start the HTML version of help c(...) generic function to combine arguments with the default forming a Variable conversion
str(a) display the internal *str*ucture of an R object vector; with recursive=TRUE descends through lists combining all as.array(x), as.data.frame(x), as.numeric(x),
summary(a) gives a “summary” of a, usually a statistical summary but it is elements into one vector as.logical(x), as.complex(x), as.character(x),
generic meaning it has different operations for different classes of a from:to generates a sequence; “:” has operator priority; 1:4 + 1 is “2,3,4,5” ... convert type; for a complete list, use methods(as)
ls() show objects in the search path; specify pat="pat" to search on a seq(from,to) generates a sequence by= specifies increment; length=
pattern specifies desired length Variable information
ls.str() str() for each variable in the search path seq(along=x) generates 1, 2, ..., length(along); useful for for is.na(x), is.null(x), is.array(x), is.data.frame(x),
dir() show files in the current directory loops is.numeric(x), is.complex(x), is.character(x),
methods(a) shows S3 methods of a rep(x,times) replicate x times; use each= to repeat “each” el- ... test for type; for a complete list, use methods(is)
methods(class=class(a)) lists all the methods to handle objects of ement of x each times; rep(c(1,2,3),2) is 1 2 3 1 2 3; length(x) number of elements in x
class a rep(c(1,2,3),each=2) is 1 1 2 2 3 3 dim(x) Retrieve or set the dimension of an object; dim(x) <- c(3,2)
data.frame(...) create a data frame of the named or unnamed dimnames(x) Retrieve or set the dimension names of an object
Input and output arguments; data.frame(v=1:4,ch=c("a","B","c","d"),n=10); nrow(x) number of rows; NROW(x) is the same but treats a vector as a one-
load() load the datasets written with save row matrix
shorter vectors are recycled to the length of the longest
data(x) loads specified data sets ncol(x) and NCOL(x) id. for columns
list(...) create a list of the named or unnamed arguments;
library(x) load add-on packages class(x) get or set the class of x; class(x) <- "myclass"
list(a=c(1,2),b="hi",c=3i);
read.table(file) reads a file in table format and creates a data unclass(x) remove the class attribute of x
array(x,dim=) array with data x; specify dimensions like
frame from it; the default separator sep="" is any whitespace; use attr(x,which) get or set the attribute which of x
dim=c(3,4,2); elements of x recycle if x is not long enough
header=TRUE to read the first line as a header of column names; use attributes(obj) get or set the list of attributes of obj
matrix(x,nrow=,ncol=) matrix; elements of x recycle
as.is=TRUE to prevent character vectors from being converted to fac-
factor(x,levels=) encodes a vector x as a factor
tors; use comment.char="" to prevent "#" from being interpreted as
gl(n,k,length=n*k,labels=1:n) generate levels (factors) by spec-
Data selection and manipulation
a comment; use skip=n to skip n lines before reading data; see the which.max(x) returns the index of the greatest element of x
ifying the pattern of their levels; k is the number of levels, and n is
help for options on row naming, NA treatment, and others which.min(x) returns the index of the smallest element of x
the number of replications
read.csv("filename",header=TRUE) id. but with defaults set for rev(x) reverses the elements of x
expand.grid() a data frame from all combinations of the supplied vec-
reading comma-delimited files sort(x) sorts the elements of x in increasing order; to sort in decreasing
tors or factors
read.delim("filename",header=TRUE) id. but with defaults set order: rev(sort(x))
rbind(...) combine arguments by rows for matrices, data frames, and
for reading tab-delimited files cut(x,breaks) divides x into intervals (factors); breaks is the number
others
read.fwf(file,widths,header=FALSE,sep=""  ,as.is=FALSE) of cut intervals or a vector of cut points
cbind(...) id. by columns
read a table of f ixed width f ormatted data into a ’data.frame’; widths match(x, y) returns a vector of the same length than x with the elements
is an integer vector, giving the widths of the fixed-width fields Slicing and extracting data of x which are in y (NA otherwise)
save(file,...) saves the specified objects (...) in the XDR platform- Indexing vectors which(x == a) returns a vector of the indices of x if the comparison op-
independent binary format x[n] nth element eration is true (TRUE), in this example the values of i for which x[i]
save.image(file) saves all objects x[-n] all but the nth element == a (the argument of this function must be a variable of mode logi-
cat(..., file="", sep=" ") prints the arguments after coercing to x[1:n] first n elements cal)
character; sep is the character separator between arguments x[-(1:n)] elements from n+1 to the end choose(n, k) computes the combinations of k events among n repetitions
print(a, ...) prints its arguments; generic, meaning it can have differ- x[c(1,4,2)] specific elements = n!/[(n − k)!k!]
ent methods for different objects x["name"] element named "name" na.omit(x) suppresses the observations with missing data (NA) (sup-
format(x,...) format an R object for pretty printing x[x > 3] all elements greater than 3 presses the corresponding line if x is a matrix or a data frame)
write.table(x,file="",row.names=TRUE,col.names=TRUE, x[x > 3 & x < 5] all elements between 3 and 5 na.fail(x) returns an error message if x contains at least one NA
sep=" ") prints x after converting to a data frame; if quote is TRUE, x[x %in% c("a","and","the")] elements in the given set
unique(x) if x is a vector or a data frame, returns a similar object but with fft(x) Fast Fourier Transform of an array nchar(x) number of characters
the duplicate elements suppressed mvfft(x) FFT of each column of a matrix
table(x) returns a table with the numbers of the differents values of x filter(x,filter) applies linear filtering to a univariate time series or
Dates and Times
(typically for integers or factors) to each series separately of a multivariate time series The class Date has dates without times. POSIXct has dates and times, includ-
subset(x, ...) returns a selection of x with respect to criteria (..., Many math functions have a logical parameter na.rm=FALSE to specify miss- ing time zones. Comparisons (e.g. >), seq(), and difftime() are useful.
typically comparisons: x$V1 < 10); if x is a data frame, the option ing data (NA) removal. Date also allows + and −. ?DateTimeClasses gives more information. See
select gives the variables to be kept or dropped using a minus sign also package chron.
sample(x, size) resample randomly and without replacement size ele-
Matrices as.Date(s) and as.POSIXct(s) convert to the respective class;
ments in the vector x, the option replace = TRUE allows to resample t(x) transpose format(dt) converts to a string representation. The default string
with replacement diag(x) diagonal format is “2001-02-21”. These accept a second argument to specify a
prop.table(x,margin=) table entries as fraction of marginal table %*% matrix multiplication format for conversion. Some common formats are:
solve(a,b) solves a %*% x = b for x
Math solve(a) matrix inverse of a %a, %A Abbreviated and full weekday name.
sin,cos,tan,asin,acos,atan,atan2,log,log10,exp rowsum(x) sum of rows for a matrix-like object; rowSums(x) is a faster %b, %B Abbreviated and full month name.
max(x) maximum of the elements of x version %d Day of the month (01–31).
min(x) minimum of the elements of x colsum(x), colSums(x) id. for columns %H Hours (00–23).
range(x) id. then c(min(x), max(x)) rowMeans(x) fast version of row means %I Hours (01–12).
sum(x) sum of the elements of x colMeans(x) id. for columns %j Day of year (001–366).
diff(x) lagged and iterated differences of vector x
prod(x) product of the elements of x
Advanced data processing %m Month (01–12).
%M Minute (00–59).
mean(x) mean of the elements of x apply(X,INDEX,FUN=) a vector or array or list of values obtained by
%p AM/PM indicator.
median(x) median of the elements of x applying a function FUN to margins (INDEX) of X
%S Second as decimal number (00–61).
quantile(x,probs=) sample quantiles corresponding to the given prob- lapply(X,FUN) apply FUN to each element of the list X
%U Week (00–53); the first Sunday as day 1 of week 1.
abilities (defaults to 0,.25,.5,.75,1) tapply(X,INDEX,FUN=) apply FUN to each cell of a ragged array given
%w Weekday (0–6, Sunday is 0).
weighted.mean(x, w) mean of x with weights w by X with indexes INDEX
%W Week (00–53); the first Monday as day 1 of week 1.
rank(x) ranks of the elements of x by(data,INDEX,FUN) apply FUN to data frame data subsetted by INDEX
%y Year without century (00–99). Don’t use.
var(x) or cov(x) variance of the elements of x (calculated on n − 1); if x is merge(a,b) merge two data frames by common columns or row names
%Y Year with century.
a matrix or a data frame, the variance-covariance matrix is calculated xtabs(a b,data=x) a contingency table from cross-classifying factors
%z (output only.) Offset from Greenwich; -0800 is 8 hours west of.
sd(x) standard deviation of x aggregate(x,by,FUN) splits the data frame x into subsets, computes
%Z (output only.) Time zone as a character string (empty if not available).
cor(x) correlation matrix of x if it is a matrix or a data frame (1 if x is a summary statistics for each, and returns the result in a convenient
vector) form; by is a list of grouping elements, each as long as the variables
in x Where leading zeros are shown they will be used on output but are optional
var(x, y) or cov(x, y) covariance between x and y, or between the on input. See ?strftime.
columns of x and those of y if they are matrices or data frames stack(x, ...) transform data available as separate columns in a data
cor(x, y) linear correlation between x and y, or correlation matrix if they frame or list into a single column
are matrices or data frames unstack(x, ...) inverse of stack()
round(x, n) rounds the elements of x to n decimals reshape(x, ...) reshapes a data frame between ’wide’ format with
log(x, base) computes the logarithm of x with base base repeated measurements in separate columns of the same record and Plotting
’long’ format with the repeated measurements in separate records; plot(x) plot of the values of x (on the y-axis) ordered on the x-axis
scale(x) if x is a matrix, centers and reduces the data; to center only use
use (direction=”wide”) or (direction=”long”) plot(x, y) bivariate plot of x (on the x-axis) and y (on the y-axis)
the option center=FALSE, to reduce only scale=FALSE (by default
hist(x) histogram of the frequencies of x
center=TRUE, scale=TRUE) Strings barplot(x) histogram of the values of x; use horiz=FALSE for horizontal
pmin(x,y,...) a vector which ith element is the minimum of x[i], paste(...) concatenate vectors after converting to character; sep= is the bars
y[i], . . . string to separate terms (a single space is the default); collapse= is dotchart(x) if x is a data frame, plots a Cleveland dot plot (stacked plots
pmax(x,y,...) id. for the maximum an optional string to separate “collapsed” results line-by-line and column-by-column)
cumsum(x) a vector which ith element is the sum from x[1] to x[i] substr(x,start,stop) substrings in a character vector; can also as- pie(x) circular pie-chart
cumprod(x) id. for the product sign, as substr(x, start, stop) <- value boxplot(x) “box-and-whiskers” plot
cummin(x) id. for the minimum strsplit(x,split) split x according to the substring split sunflowerplot(x, y) id. than plot() but the points with similar coor-
cummax(x) id. for the maximum grep(pattern,x) searches for matches to pattern within x; see ?regex dinates are drawn as flowers which petal number represents the num-
union(x,y), intersect(x,y), setdiff(x,y), setequal(x,y), gsub(pattern,replacement,x) replacement of matches determined ber of points
is.element(el,set) “set” functions by regular expression matching sub() is the same but only replaces stripplot(x) plot of the values of x on a line (an alternative to
Re(x) real part of a complex number the first occurrence. boxplot() for small sample sizes)
Im(x) imaginary part tolower(x) convert to lowercase coplot(x˜y | z) bivariate plot of x and y for each value or interval of
Mod(x) modulus; abs(x) is the same toupper(x) convert to uppercase values of z
Arg(x) angle in radians of the complex number match(x,table) a vector of the positions of first matches for the elements interaction.plot (f1, f2, y) if f1 and f2 are factors, plots the
Conj(x) complex conjugate of x among table means of y (on the y-axis) with respect to the values of f1 (on the
convolve(x,y) compute the several kinds of convolutions of two se- x %in% table id. but returns a logical vector x-axis) and of f2 (different curves); the option fun allows to choose
quences pmatch(x,table) partial matches for the elements of x among table the summary statistic of y (by default fun=mean)
matplot(x,y) bivariate plot of the first column of x vs. the first one of y, mtext(text, side=3, line=0, ...) adds text given by text in lty controls the type of lines, can be an integer or string (1: "solid",
the second one of x vs. the second one of y, etc. the margin specified by side (see axis() below); line specifies the 2: "dashed", 3: "dotted", 4: "dotdash", 5: "longdash", 6:
fourfoldplot(x) visualizes, with quarters of circles, the association be- line from the plotting area "twodash", or a string of up to eight characters (between "0" and
tween two dichotomous variables for different populations (x must segments(x0, y0, x1, y1) draws lines from points (x0,y0) to points "9") which specifies alternatively the length, in points or pixels, of
be an array with dim=c(2, 2, k), or a matrix with dim=c(2, 2) if (x1,y1) the drawn elements and the blanks, for example lty="44" will have
k = 1) arrows(x0, y0, x1, y1, angle= 30, code=2) id. with arrows the same effect than lty=2
assocplot(x) Cohen–Friendly graph showing the deviations from inde- at points (x0,y0) if code=2, at points (x1,y1) if code=1, or both if lwd a numeric which controls the width of lines, default 1
pendence of rows and columns in a two dimensional contingency ta- code=3; angle controls the angle from the shaft of the arrow to the mar a vector of 4 numeric values which control the space between the axes
ble edge of the arrow head and the border of the graph of the form c(bottom, left, top,
mosaicplot(x) ‘mosaic’ graph of the residuals from a log-linear regres- abline(a,b) draws a line of slope b and intercept a right), the default values are c(5.1, 4.1, 4.1, 2.1)
sion of a contingency table abline(h=y) draws a horizontal line at ordinate y mfcol a vector of the form c(nr,nc) which partitions the graphic window
pairs(x) if x is a matrix or a data frame, draws all possible bivariate plots abline(v=x) draws a vertical line at abcissa x as a matrix of nr lines and nc columns, the plots are then drawn in
between the columns of x abline(lm.obj) draws the regression line given by lm.obj columns
plot.ts(x) if x is an object of class "ts", plot of x with respect to time, x rect(x1, y1, x2, y2) draws a rectangle which left, right, bottom, and mfrow id. but the plots are drawn by row
may be multivariate but the series must have the same frequency and top limits are x1, x2, y1, and y2, respectively pch controls the type of symbol, either an integer between 1 and 25, or any
dates polygon(x, y) draws a polygon linking the points with coordinates given single character within ""
ts.plot(x) id. but if x is multivariate the series may have different dates by x and y 1 ● 2 3 4 5 6 7 8 9 10 ● 11 12 13 ● 14 15
and must have the same frequency legend(x, y, legend) adds the legend at the point (x,y) with the sym- 16 ● 17 18 19 ● 20 ● 21 ● 22 23 24 25 * * . X X a a ? ?
qqnorm(x) quantiles of x with respect to the values expected under a nor- bols given by legend ps an integer which controls the size in points of texts and symbols
mal law title() adds a title and optionally a sub-title pty a character which specifies the type of the plotting region, "s": square,
qqplot(x, y) quantiles of y with respect to the quantiles of x axis(side, vect) adds an axis at the bottom (side=1), on the left (2), "m": maximal
contour(x, y, z) contour plot (data are interpolated to draw the at the top (3), or on the right (4); vect (optional) gives the abcissa (or tck a value which specifies the length of tick-marks on the axes as a fraction
curves), x and y must be vectors and z must be a matrix so that ordinates) where tick-marks are drawn of the smallest of the width or height of the plot; if tck=1 a grid is
dim(z)=c(length(x), length(y)) (x and y may be omitted) rug(x) draws the data x on the x-axis as small vertical lines drawn
filled.contour(x, y, z) id. but the areas between the contours are locator(n, type="n", ...) returns the coordinates (x, y) after the tcl a value which specifies the length of tick-marks on the axes as a fraction
coloured, and a legend of the colours is drawn as well user has clicked n times on the plot with the mouse; also draws sym- of the height of a line of text (by default tcl=-0.5)
image(x, y, z) id. but with colours (actual data are plotted) bols (type="p") or lines (type="l") with respect to optional graphic xaxt if xaxt="n" the x-axis is set but not drawn (useful in conjonction with
persp(x, y, z) id. but in perspective (actual data are plotted) parameters (...); by default nothing is drawn (type="n") axis(side=1, ...))
stars(x) if x is a matrix or a data frame, draws a graph with segments or a yaxt if yaxt="n" the y-axis is set but not drawn (useful in conjonction with
star where each row of x is represented by a star and the columns are
Graphical parameters axis(side=2, ...))
the lengths of the segments These can be set globally with par(...); many can be passed as parameters
symbols(x, y, ...) draws, at the coordinates given by x and y, sym- to plotting commands.
bols (circles, squares, rectangles, stars, thermometres or “boxplots”) adj controls text justification (0 left-justified, 0.5 centred, 1 right-justified)
which sizes, colours . . . are specified by supplementary arguments bg specifies the colour of the background (ex. : bg="red", bg="blue", . . . Lattice (Trellis) graphics
termplot(mod.obj) plot of the (partial) effects of a regression model the list of the 657 available colours is displayed with colors())
bty controls the type of box drawn around the plot, allowed values are: "o", xyplot(y˜x) bivariate plots (with many functionalities)
(mod.obj) barchart(y˜x) histogram of the values of y with respect to those of x
The following parameters are common to many plotting functions: "l", "7", "c", "u" ou "]" (the box looks like the corresponding char-
acter); if bty="n" the box is not drawn dotplot(y˜x) Cleveland dot plot (stacked plots line-by-line and column-
add=FALSE if TRUE superposes the plot on the previous one (if it exists) by-column)
axes=TRUE if FALSE does not draw the axes and the box cex a value controlling the size of texts and symbols with respect to the de-
fault; the following parameters have the same control for numbers on densityplot(˜x) density functions plot
type="p" specifies the type of plot, "p": points, "l": lines, "b": points histogram(˜x) histogram of the frequencies of x
connected by lines, "o": id. but the lines are over the points, "h": the axes, cex.axis, the axis labels, cex.lab, the title, cex.main,
and the sub-title, cex.sub bwplot(y˜x) “box-and-whiskers” plot
vertical lines, "s": steps, the data are represented by the top of the qqmath(˜x) quantiles of x with respect to the values expected under a the-
vertical lines, "S": id. but the data are represented by the bottom of col controls the color of symbols and lines; use color names: "red", "blue"
see colors() or as "#RRGGBB"; see rgb(), hsv(), gray(), and oretical distribution
the vertical lines stripplot(y˜x) single dimension plot, x must be numeric, y may be a
xlim=, ylim= specifies the lower and upper limits of the axes, for exam- rainbow(); as for cex there are: col.axis, col.lab, col.main,
col.sub factor
ple with xlim=c(1, 10) or xlim=range(x) qq(y˜x) quantiles to compare two distributions, x must be numeric, y may
xlab=, ylab= annotates the axes, must be variables of mode character font an integer which controls the style of text (1: normal, 2: italics, 3:
bold, 4: bold italics); as for cex there are: font.axis, font.lab, be numeric, character, or factor but must have two ‘levels’
main= main title, must be a variable of mode character splom(˜x) matrix of bivariate plots
sub= sub-title (written in a smaller font) font.main, font.sub
las an integer which controls the orientation of the axis labels (0: parallel to parallel(˜x) parallel coordinates plot
Low-level plotting commands the axes, 1: horizontal, 2: perpendicular to the axes, 3: vertical) levelplot(z˜x*y|g1*g2) coloured plot of the values of z at the coor-
points(x, y) adds points (the option type= can be used) dinates given by x and y (x, y and z are all of the same length)
lines(x, y) id. but with lines wireframe(z˜x*y|g1*g2) 3d surface plot
text(x, y, labels, ...) adds text given by labels at coordi- cloud(z˜x*y|g1*g2) 3d scatter plot
nates (x,y); a typical use is: plot(x, y, type="n"); text(x, y,
names)
In the normal Lattice formula, y x|g1*g2 has combinations of optional con- rpois(n, lambda) Poisson
ditioning variables g1 and g2 plotted on separate panels. Lattice functions rweibull(n, shape, scale=1) Weibull
take many of the same arguments as base graphics plus also data= the data rcauchy(n, location=0, scale=1) Cauchy
frame for the formula variables and subset= for subsetting. Use panel= rbeta(n, shape1, shape2) beta
to define a custom panel function (see apropos("panel") and ?llines). rt(n, df) ‘Student’ (t)
Lattice functions return an object of class trellis and have to be print-ed to rf(n, df1, df2) Fisher–Snedecor (F) (χ2 )
produce the graph. Use print(xyplot(...)) inside functions where auto- rchisq(n, df) Pearson
matic printing doesn’t work. Use lattice.theme and lset to change Lattice rbinom(n, size, prob) binomial
defaults. rgeom(n, prob) geometric
rhyper(nn, m, n, k) hypergeometric
Optimization and model fitting rlogis(n, location=0, scale=1) logistic
optim(par, fn, method = c("Nelder-Mead", "BFGS", rlnorm(n, meanlog=0, sdlog=1) lognormal
"CG", "L-BFGS-B", "SANN") general-purpose optimization; rnbinom(n, size, prob) negative binomial
par is initial values, fn is function to optimize (normally minimize) runif(n, min=0, max=1) uniform
nlm(f,p) minimize function f using a Newton-type algorithm with starting rwilcox(nn, m, n), rsignrank(nn, n) Wilcoxon’s statistics
values p All these functions can be used by replacing the letter r with d, p or q to
lm(formula) fit linear models; formula is typically of the form response get, respectively, the probability density (dfunc(x, ...)), the cumulative
termA + termB + ...; use I(x*y) + I(xˆ2) for terms made of probability density (pfunc(x, ...)), and the value of quantile (qfunc(p,
nonlinear components ...), with 0 < p < 1).
glm(formula,family=) fit generalized linear models, specified by giv-
ing a symbolic description of the linear predictor and a description of
the error distribution; family is a description of the error distribution
and link function to be used in the model; see ?family Programming
nls(formula) nonlinear least-squares estimates of the nonlinear model
function( arglist ) expr function definition
parameters
return(value)
approx(x,y=) linearly interpolate given data points; x can be an xy plot-
if(cond) expr
ting structure
if(cond) cons.expr else alt.expr
spline(x,y=) cubic spline interpolation
for(var in seq) expr
loess(formula) fit a polynomial surface using local fitting
while(cond) expr
Many of the formula-based modeling functions have several common argu-
repeat expr
ments: data= the data frame for the formula variables, subset= a subset of
break
variables used in the fit, na.action= action for missing values: "na.fail",
next
"na.omit", or a function. The following generics often apply to model fitting
Use braces {} around statements
functions:
ifelse(test, yes, no) a value with the same shape as test filled
predict(fit,...) predictions from fit based on input data
with elements from either yes or no
df.residual(fit) returns the number of residual degrees of freedom
do.call(funname, args) executes a function call from the name of
coef(fit) returns the estimated coefficients (sometimes with their
the function and a list of arguments to be passed to it
standard-errors)
residuals(fit) returns the residuals
deviance(fit) returns the deviance
fitted(fit) returns the fitted values
logLik(fit) computes the logarithm of the likelihood and the number of
parameters
AIC(fit) computes the Akaike information criterion or AIC
Statistics
aov(formula) analysis of variance model
anova(fit,...) analysis of variance (or deviance) tables for one or more
fitted model objects
density(x) kernel density estimates of x
binom.test(), pairwise.t.test(), power.t.test(),
prop.test(), t.test(), ... use help.search("test")
Distributions
rnorm(n, mean=0, sd=1) Gaussian (normal)
rexp(n, rate=1) exponential
rgamma(n, shape, scale=1) gamma
Data frame with an example.
A data frame is a table or a two-dimensional array-like structure in which each column contains
values of one variable and each row contains one set of values from each column.
Following are the characteristics of a data frame.
The column names should be non-empty.
The row names should be unique.
The data stored in a data frame can be of numeric, factor or character type.
Each column should contain same number of data items.

Operators in R.
R's binary and logical operators will look very familiar to programmers. Note that binary operators
work on vectors and matrices as well as scalars.
Arithmetic Operators
Operator Description
+ addition
- subtraction
* multiplication
/ division
^ or ** exponentiation
x %% y modulus (x mod y) 5%%2 is 1
x %/% y integer division 5%/%2 is 2
Logical Operators
Operator Description
< less than
<= less than or equal to
> greater than
>= greater than or equal to
== exactly equal to
!= not equal to
!x Not x
x|y x OR y
x&y x AND y
isTRUE(x) test if X is TRUE
# An example
x <- c(1:10)
x[(x>8) | (x<5)]
# yields 1 2 3 4 9 10
# How it works
x <- c(1:10)
x
1 2 3 4 5 6 7 8 9 10
x>8
FFFFFFFFTT
x<5
TTTTFFFFFF
x>8|x<5
TTTTFFFFTT
x[c(T,T,T,T,F,F,F,F,T,T)]
1 2 3 4 9 103.
List in R

Loops Structures in R.
In R programming, we require a control structure to run a block of code multiple times. Loops come
in the class of the most fundamental and strong programming concepts. A loop is a control
statement that allows multiple executions of a statement or a set of statements. The word ‘looping’
means cycling or iterating.
A loop asks a query, in the loop structure. If the answer to that query requires an action, it will be
executed. The same query is asked again and again until further action is taken. Any time the query
is asked in the loop, it is known as an iteration of the loop. There are two components of a loop,
the control statement, and the loop body. The control statement controls the execution of
statements depending on the condition and the loop body consists of the set of statements to be
executed.
There are three types of loop in R programming:
For Loop
While Loop
Repeat Loop
For Loop in R
It is a type of control statement that enables one to easily construct a loop that has to run
statements or a set of statements multiple times. For loop is commonly used to iterate over items of
a sequence. It is an entry controlled loop, in this loop the test condition is tested first, then the body
of the loop is executed, the loop body would not be executed if the test condition is false.
R – For loop Syntax:
for (value in sequence)
{
statement
}
For Loop Flow Diagram:

R – While loop Syntax:


while ( condition )
{
statement
}
While loop Flow Diagram:

Repeat Loop in R
It is a simple loop that will run the same statement or a group of statements
repeatedly until the stop condition has been encountered. Repeat loop does not
have any condition to terminate the loop, a programmer must specifically place a
condition within the loop’s body and use the declaration of a break statement to
terminate this loop. If no condition is present in the body of the repeat loop then it
will iterate infinitely.
R – Repeat loop Syntax:
repeat
{
statement

if( condition )
{
break
}
}
Repeat loop Flow Diagram:
To terminate the repeat loop, we use a jump statement that is
the break keyword. Below are some programs to illustrate the use of repeat
loops in R programming.

If-else Construct in R
R if statement
The syntax of if statement is:
if (test_expression) {
statement
}
If the test_expression is TRUE, the statement gets executed. But if it’s FALSE, nothing
happens.
Here, test_expression can be a logical or numeric vector, but only the first element
is taken into consideration.
In the case of numeric vector, zero is taken as FALSE, rest as TRUE.
Flowchart of if statement

Example: if statement

x <- 5
if(x > 0){
print("Positive number")
}
Output
[1] "Positive number"

if…else statement
The syntax of if…else statement is:
if (test_expression) {
statement1
} else {
statement2
}
The else part is optional and is only evaluated if test_expression is FALSE.
It is important to note that else must be in the same line as the closing braces of the
if statement.
Flowchart of if…else statement

Example of if…else statement

x <- -5
if(x > 0){
print("Non-negative number")
} else {
print("Negative number")
}
Output
[1] "Negative number"
The above conditional can also be written in a single line as follows.
if(x > 0) print("Non-negative number") else print("Negative number")
This feature of R allows us to write construct as shown below.
> x <- -5
> y <- if(x > 0) 5 else 6
> y
[1] 6

Box-plot in R.
The boxplot() function shows how the distribution of a numerical variable y differs across the
unique levels of a second variable, x. To be effective, this second variable should not have too many
unique levels (e.g., 10 or fewer is good; many more than this makes the plot difficult to interpret).
The boxplot() function also has a number of optional parameters, and this exercise asks you
to use three of them to obtain a more informative plot:
varwidth allows for variable-width Box Plot that shows the different sizes of the data subsets.
log allows for log-transformed y-values.
las allows for more readable axis labels.
When you should use a Box Plot?
When you have a continuous variable, split by a categorical variable.
When you want to compare the distributions of the continuous variable for each category.
Histogram vs. Box Plot
Below is the comparison of a Histogram vs. a Box Plot. The line in the middle shows the median of
the distribution. That is, half the monarchs started ruling before this age, and half after this age.

The box in the Box Plot extends from the lower quartile to the upper quartile. The lower quartile is
the point where one-quarter of the values are below it. That is, one-quarter of the monarchs started
ruling before this age, and three-quarters after it. Likewise, the upper quartile is the age where three
quarters of the monarchs started ruling below this age. The difference between the upper quartile
and the lower quartile is called the inter-quartile range.

Sampling and its steps.


A Sampling is a part of the total population. It can be an individual element or a group of elements
selected from the population. Although it is a subset, it is representative of the population and
suitable for research in terms of cost, convenience, and time. The sample group can be selected
based on a probability or a non-probability approach. A sample usually consists of various units of
the population. The size of the sample is represented by “n”.
A good sample is one which satisfies all or few of the following conditions:
Representativeness: When sampling method is adopted by the researcher, the basic assumption
is that the samples so selected out of the population are the best representative of the population
under study. Thus good samples are those who accurately represent the population. Probability
sampling technique yield representative samples.
Steps in Sampling Process
An operational sampling process can be divided into seven steps as given below:
Defining the target population.
Specifying the sampling frame.
Specifying the sampling unit.
Selection of the sampling method.
Determination of sample size.
Specifying the sampling plan.
Selecting the sample.
Random Sampling and Stratified Random Sampling.
Simple random samples and stratified random samples are both statistical measurement tools. A simple
random sample is used to represent the entire data population. A stratified random sample divides the
population into smaller groups, or strata, based on shared characteristics.The population is the total set
of observations or data. A sample is a set of observations from the population. The sampling method is
the process used to pull samples from the population.Unlike simple random samples, stratified random
samples are used with populations that can be easily broken into different subgroups or subsets. These
groups are based on certain criteria, then randomly choose elements from each in proportion to the
group's size versus the population.This method of sampling means there will be selections from each
different group—the size of which is based on its proportion to the entire population. But the
researchers must ensure the strata do not overlap. Each point in the population must only belong to
one stratum so each point is mutually exclusive.
Pie-chart in R.
A pie-chart is a representation of values as slices of a circle with different colors. The slices
are labeled and the numbers corresponding to each slice is also represented in the chart.
In R the pie chart is created using the pie() function which takes positive numbers as a
vector input. The additional parameters are used to control labels, color, title etc.
Syntax
The basic syntax for creating a pie-chart using the R is −
pie(x, labels, radius, main, col, clockwise)
Following is the description of the parameters used −
x is a vector containing the numeric values used in the pie chart.
labels is used to give description to the slices.
radius indicates the radius of the circle of the pie chart.(value between −1 and +1).
main indicates the title of the chart.
col indicates the color palette.
clockwise is a logical value indicating if the slices are drawn clockwise or anti clockwise.
Example
A very simple pie-chart is created using just the input vector and labels. The below script will
create and save the pie chart in the current R working directory.
# Create data for the graph.
x <- c(21, 62, 10, 53)
labels <- c("London", "New York", "Singapore", "Mumbai")
# Give the chart file a name.
png(file = "city.png")
# Plot the chart.
pie(x,labels)
# Save the file.
dev.off()
10 Explain Scatter plot in R.

Scatterplots show many points plotted in the Cartesian plane. Each point represents the values of
two variables. One variable is chosen in the horizontal axis and another in the vertical axis.
The simple scatterplot is created using the plot() function.
Syntax
The basic syntax for creating scatterplot in R is −
plot(x, y, main, xlab, ylab, xlim, ylim, axes)
Following is the description of the parameters used −
x is the data set whose values are the horizontal coordinates.
y is the data set whose values are the vertical coordinates.
main is the tile of the graph.
xlab is the label in the horizontal axis.
ylab is the label in the vertical axis.
xlim is the limits of the values of x used for plotting.
ylim is the limits of the values of y used for plotting.
axes indicates whether both axes should be drawn on the plot.
Example
We use the data set "mtcars" available in the R environment to create a basic scatterplot. Let's use
the columns "wt" and "mpg" in mtcars.
input <- mtcars[,c('wt','mpg')]
print(head(input))
When we execute the above code, it produces the following result −
wt mpg
Mazda RX4 2.620 21.0
Mazda RX4 Wag 2.875 21.0
Datsun 710 2.320 22.8
Hornet 4 Drive 3.215 21.4
Hornet Sportabout 3.440 18.7
Valiant 3.460 18.1
Creating the Scatterplot
The below script will create a scatterplot graph for the relation between wt(weight) and mpg(miles
per gallon).
# Get the input values.
input <- mtcars[,c('wt','mpg')]

# Give the chart file a name.


png(file = "scatterplot.png")

# Plot the chart for cars with weight between 2.5 to 5 and mileage between 15 and 30.
plot(x = input$wt,y = input$mpg,
xlab = "Weight",
ylab = "Milage",
xlim = c(2.5,5),
ylim = c(15,30),
main = "Weight vs Milage"
)
# Save the file.
dev.off()
When we execute the above code, it produces the following result −

Simple Random Sampling


Researchers can create a simple random sample using a couple of methods. With a lottery
method, each member of the population is assigned a number, after which numbers are
selected at random.
An example of a simple random sample would be the names of 25 employees being chosen
out of a hat from a company of 250 employees. In this case, the population is all 250
employees, and the sample is random because each employee has an equal chance of
being chosen. Random sampling is used in science to conduct randomized control tests or
for blinded experiments.
The example in which the names of 25 employees out of 250 are chosen out of a hat is an
example of the lottery method at work. Each of the 250 employees would be assigned a
number between 1 and 250, after which 25 of those numbers would be chosen at random.
Because individuals who make up the subset of the larger group are chosen at random,
each individual in the large population set has the same probability of being selected. This
creates, in most cases, a balanced subset that carries the greatest potential for
representing the larger group as a whole, free from any bias.

Systematic Sampling .
Systematic sampling is a statistical method that researchers use to zero down on the
desired population they want to research. Researchers calculate the sampling interval by
dividing the entire population size by the desired sample size. Systematic sampling is an
extended implementation of probability sampling in which each member of the group is
selected at regular periods to form a sample.
Systematic sampling definition
Systematic sampling is defined as a probability sampling method where the researcher
chooses elements from a target population by selecting a random starting point and
selects sample members after a fixed ‘sampling interval.’
For example, in school, while selecting the captain of a sports team, most of our coaches asked us to
call out numbers such as 1-5 (1-n) and the students with a random number decided by the coach.
For instance, three would be called out to be the captains of different teams. It is a non-stressful
selection process for both the coach and the players. There’s an equal opportunity for every
member of a population to be selected using this sampling technique.
Simple Linear Regression.
Regression analysis is a statistical technique for analysing and comprehending the
connection between two or more variables of interest. The methodology used to do
regression analysis aids in understanding which elements are significant, which may be
ignored, and how they interact with one another.

Regression is a statistical approach used in finance, investment, and other fields to identify
the strength and type of a connection between one dependent variable (typically represented
by Y) and a sequence of other variables (known as independent variables).
Regression is essentially the "best guess" at utilising a collection of data to generate some
form of forecast. It is the process of fitting a set of points to a graph.

Regression analysis is a mathematical method for determining which of those factors has
an effect. It provides answers to the following questions:

Which factors are most important


Which of these may we disregard
How do those elements interact with one another, and perhaps most significantly, how
confident are we in all of these variables
Simple linear regression

The relationship between a dependent variable and a single in


dependent variable is described using a basic linear regression methodology. A Simple
Linear Regression model reveals a linear or slanted straight line relation, thus the name.

The simple linear model is expressed using the following equation:

Y = a + bX + ϵ

Where:

Y – variable that is dependent


X – Independent (explanatory) variable
a – Intercept
b – Slope
ϵ – Residual (error)

The dependent variable needs to be continuous/real, which is the most crucial component
of Simple Linear Regression. On the other hand, the independent variable can be evaluated
using either continuous or categorical values.
Applications of Correlation.
As we know correlation is all about relationship between two random variables X and Y, and
there are multiple ways of measuring correlation like Pearson correlation coefficient,
Spearman’s rank correlation coefficient, etc.
Now you might wonder what the applications of correlation are. Let me give you a few
examples from a wide spectrum of areas.
Ecommerce:
1. Time spent on E-commerce website vs Money spent by a customer
Considering any e-commerce company whose website is accessed by millions of users all over
the world. The company can look at all the last 24 hours of data and they can measure how
much time was spent by a customer in the last 24 hours and respective money spent by
customer in next 24 hours. Let’s assume I spent some 30 minutes on ecommerce website
browsing through multiple products and I spent 20 dollars’ worth of items in the next 24
hours. Other person who spent like 60 minutes ended up buying some 100 dollars’ worth of
products, etc. The company will have this data for millions of customers. If we know the
correlation between these two variable i.e( Time spent on the website vs Money spent), then
the company can make decision on targeting customers. And they can design their website or
modify their website to encourage people to spend more time on the website itself. This is
super-duper important for any e-commerce company to have attractive and user friendly
website.
2. Number of unique user vs Sales in a day
There could be another example in ecommerce like number of unique users in a day versus
Dollar sales in day. So let’s assume, in a day, 100K users visited my website and, let’s say my
sales were 1 million dollars. The other day when I had 120 K users, my sales were, let’s say, 1.6
million dollars, so and so forth. Now, if I know that these two are correlated, which means as
the number of unique visitors increases, my sales also will increase. This correlation will help
the company to perform a lot of experimental stuff on increasing the number of unique
visitors in a day by either giving new advertisements, by sending emails, by updating the
mobile app or by giving attractive deals etc.
Real Estate:
Salary/income of person vs Area of their Home
People with higher salary tend to buy higher square footage home. Considering this
statement, a real estate brokerage will market or show case the properties. Suppose I’m a real
estate brokerage who wants you to fill a form and that form has fields like Name, address,
phone and salary range etc. Once the data collection from different people is completed then I
will check if the salary of a person is correlated with the square footage of home? If yes, then a
real estate broker to spend time showing you homes with respect to your salary.
Education Ministry:
Number of years of study vs Income
Is the number of years of education/study correlated with income?
Suppose you work for Education ministry or the Labor Ministry and you need to take some
initiatives to develop educations in your district or state. Once you collected the surveyed
data, you would try to understand the income of people with respect to their education. The
education of people can be categorized in Primary education, higher secondary, pre-
university, bachelor degree, Master degree etc..
Based on the correlation measurement, if you know that more the number of years of
education, the more is the possible income, you would encourage people on more years of
study. And accordingly you will focus on respective area of education.
These are a few examples of how correlation is important in data analysis in a wide spectrum
of areas. So we have seen example on Real estate. We have also seen how governments can
use this. We’ve seen how e-commerce companies use it. These are just a few of the thousands
and thousands of examples of how correlation is useful.

Different types of Correlation.


Correlation means association - more precisely it is a measure of the extent to which two variables
are related. There are three possible results of a correlational study: a positive correlation, a
negative correlation, and no correlation.
• A positive correlation is a relationship between two variables in which both variables
move in the same direction. Therefore, when one variable increases as the other
variable increases, or one variable decreases while the other decreases. An example of
positive correlation would be height and weight. Taller people tend to be heavier.
• A negative correlation is a relationship between two variables in which an increase in
one variable is associated with a decrease in the other. An example of negative
correlation would be height above sea level and temperature. As you climb the
mountain (increase in height) it gets colder (decrease in temperature).
• A zero correlation exists when there is no relationship between two variables. For
example there is no relationship between the amount of tea drunk and level of
intelligence.
Applications of Simple Linear Regression

Marks scored by students based on number of hours studied


(ideally)- Here marks scored in exams are independent and the number of
hours studied is independent.
Predicting crop yields based on the amount of rainfall- Yield is a
dependent variable while the measure of precipitation is an independent
variable.
Predicting the Salary of a person based on years of experience- Therefore,
Experience becomes the independent while Salary turns into the dependent
variable.
Regression and Correlation.
Correlation and Regression are the two important concepts in Statistical research, which is
based on variable distribution. A variable distribution is explained as a classification/distribution
of multiple variables. Correlation and Regression are one of the significant chapters for the Class
12 students. It is very important for students to learn and understand the differences between
these two factors.
Correlation is explained as an analysis which helps us to determine the absence of the
relationship between the two variables – ‘p’ and ‘q’.
Regression too is an analysis, that foretells the value of a dependent variable based on the value,
that is already known of the independent variable.
Difference Between Correlation And Regression
As mentioned earlier, Correlation and Regression are the principal units to be studied while
preparing for the 12th Board examinations. Also, it is an important factor for students to be well
aware of the differences between correlation and regression. Below mentioned are a few key
differences between these two aspects.

Correlation Regression

‘Correlation’ as the name says it determines the ‘Regression’ explains how an independent
interconnection or a co-relationship between the variable is numerically associated with the
variables. dependent variable.
In Correlation, both the independent and dependent However, in Regression, both the dependent and
values have no difference. independent variable are different.

The primary objective of Correlation is, to find out a When it comes to regression, its primary intent is,
quantitative/numerical value expressing the association to reckon the values of a haphazard variable based
between the values. on the values of the fixed variable.

Correlation stipulates the degree to which both of the However, regression specifies the effect of the
variables can move together. change in the unit, in the known variable(p) on the
evaluated variable (q).

Correlation helps to constitute the connection between Regression helps in estimating a variable’s value
the two variables. based on another given value.

Correlation Coefficient and R Function


Correlation coefficient formulas are used to find how strong a relationship is between data. The formulas
return a value between -1 and 1, where:

• 1 indicates a strong positive relationship.


• -1 indicates a strong negative relationship.
• A result of zero indicates no relationship at all.

R function to find correlation coefficient is as follows:


cor(x, y) where x and y are matrices or data frames.
t-test and Chi square test
Both t-tests and chi-square tests are statistical tests, designed to test, and possibly reject, a
null hypothesis. The null hypothesis is usually a statement that something is zero, or that
something does not exist. For example, you could test the hypothesis that the difference
between two means is zero, or you could test the hypothesis that there is no relationship
between two variables.
Null Hypothesis Tested
A t-test tests a null hypothesis about two means; most often, it tests the hypothesis that
two means are equal, or that the difference between them is zero. For example, we could
test whether boys and girls in fourth grade have the same average height.
A chi-square test tests a null hypothesis about the relationship between two variables. For
example, you could test the hypothesis that men and women are equally likely to vote
"Democratic," "Republican," "Other" or "not at all."
Types of Data
A t-test requires two variables; one must be categorical and have exactly two levels, and
the other must be quantitative and be estimable by a mean. For example, the two groups
could be Republicans and Democrats, and the quantitative variable could be age.
A chi-square test requires categorical variables, usually only two, but each may have any
number of levels. For example, the variables could be ethnic group — White, Black, Asian,
American Indian/Alaskan native, Native Hawaiian/Pacific Islander, other, multiracial; and
presidential choice in 2008 — (Obama, McCain, other, did not vote).
Variations
There are variations of the t-test to cover paired data; for example, husbands and wives, or
right and left eyes. There are variations of the chi-square to deal with ordinal data — that is,
data that has an order, such as "none," "a little," "some," "a lot" — and to deal with more
than two variables.
Conclusions
The t-test allows you to say either "we can reject the null hypothesis of equal means at the
0.05 level" or "we have insufficient evidence to reject the null of equal means at the 0.05
level." A chi-square test allows you to say either "we can reject the null hypothesis of no
relationship at the 0.05 level" or "we have insufficient evidence to reject the null at the
0.05 level."

You might also like