0% found this document useful (0 votes)
8 views61 pages

R 1st Unit

Uploaded by

shubham mahant
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views61 pages

R 1st Unit

Uploaded by

shubham mahant
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

R Vector Explained with Examples

In simple words, the vector in R is a very basic data structure in R Programming. Using vectors
we can create other data structures like Data Frame and Matrix.

1. What is R Vector?

The vector in R is the simplest basic t ype of object i n R. A vector is a sequence of data elements
of the same t ype. A vector is similar to a sequence object in Pyt hon pandas. Members of a vector
are called Components. Below are some key points to note and remember about vectors.

 Vector is a basic data struct ure in R.


 It is a one-dimensional data structure.
 It holds elements of the same t ype.
 Members in the vector are called components.
 It is not recursive.
 We can easily convert vectors into data frames or matrices.
 In DataFrame, each column is considered a vector.
2. Create Vector

Let’s create a simple int eger vector using c(). Note that the c() funct ion stands for concatenate.
but it is one of the most used ways to create a vector in R.

# Integer vector
v_int = c(1,3,6,7)
print(v_int)

# Output
#[1] 1 3 6 7
And, you can also use this to create a character vector. Alternat ively, you can also use
the character() funct ion.

# Character vector
v_char = c('One','Two','Three')
print(v_char)

# Output
#[1] "One" "Two" "Three"

Use the lengt h() funct ion to get the lengt h of t he vector. Pass the vector as the argument to this
funct ion to get the length.

# Length o f the vector


length(v_char)

# Output
#[1] 3

You can get the type of the vector using typeof() funct ion. This takes the vector object as an
argument and returns it s type.

# Type of the vector


typeof(v_char)
# Output
[1] "character"

If you have a character vector, you can easily convert the character vector to double by
using as.vector() funct ion wit h a mode value "numeric".

# Syntax o f as.vector()
as.vector(x, mode = "any")

Example

# Convert character vector to double


x <- c("1","2")
print(typeof(x))
x <- as.vector(x,"numeric")
print(typeof(x))

# Output
#[1] "character"
#[1] "double"

Use is.vector() to check the t ype of the vector a specific t ype.

# Syntax o f is.vector()
is.vector(x, mode = "any")

Example

# Check t ype o f vector


x <- c("1","2")
is.vector(x,'character')
is.vector(x,'numeric')

# Output
[1] TRUE
[1] FALSE

3. Empty Vector

If the length of the vector is zero then it is considered an empt y vector in R. It can be created in
several ways in R programming. One way to create an empt y vector is by using c() function
without any arguments.

# Create empt y vector


v_empt y = c()
print(v_empt y)

# Output
NULL

Now let’s get the length of the vector and check if it is empt y.

# Length o f the vector


length(v_empt y)
# Output
#[1] 0

4. Convert Vectors into DataFrame

By using data.frame() funct ion we can create a DataFrame from Vectors. A data frame is a list of
variables of the same number of rows with unique row names. so all vectors you used to convert
to DataFrame should have the same length, not having the same length will result in an error.

# Create Vectors
id <- c(10,11,12,13)
name <- c('sai','ram','deepika','sahit hi')
dob <- as.Date(c('1990-10-02','1981-3-24','1987-6-14','1985-8-16'))

# Create DataFrame
df <- data.frame(id,name,dob)

Yields below output. In the above example, I have used the vectors id, name and dob as
arguments to the data.frame() funct ion, separated by commas. The above example yields the
below output. R will create a data frame wit h the column names/variables wit h the same names
we used for Vector. You can use df or print(df) to print the DataFrame to the console.

# Output
id name dob
1 10 sai 1990-10-02
2 11 ram 1981-03-24
3 12 deepika 1987-06-14
4 13 sahit hi 1985-08-16
6. Repeating Vectors

By using rep() funct ion we can repeat the members in a vector with the specified number
of t imes, and create a new vector with the result.

# Repeat vectors
v_rep = rep(c(1, 5), times = 3)
print(v_rep)

# Output
#[1] 1 5 1 5 1 5
Let’s pass a vector to times and see what happens. This is used to specify how often the values
are to be repeated. You can also use each argument to specify how many t imes to repeat each
value.

# Repeat vectors
v_rep = rep(c(1, 5), times = c(2,3))
print(v_rep)

# Output
#[1] 1 1 5 5 5

7. Types of Vector in R

You can create a vector with all basic t ypes.

 Integer Vectors – Stores only int eger values.


 Numeric Vectors – For all numeric values.
 Character Vectors – Stores alphanumeric characters.
 Logical Vectors – Stores TRUE/FALSE values.
 Datetime Vectors – Which are used to store date time values
 Factor Vectors – To work with categories.
8. Combine Two or Multiple Vectors

As explained above c() is actually used to combine members hence, let’s use this to combine two
or more vectors into a single vector. By using c() or append() you can also add element to vector
in R.

# Create character vectors


v_char = c('One','Two','Three')
v_char2 = c('Four','Five','Six')

# Combine vectors
v_combine = c(v_char,v_char2)
print(v_combine)

# Output
#[1] "One" "Two" "Three" "Four" "Five" "Six"

9. Access Vector

By using R bracket notation you can access the members by index. Note that, unlike other
programming languages, the R index starts from 1. To get the range of values
use startindex:endindex, this returns all values between the start and end index. Let’s access it
by giving an index inside a single square bracket [].

# Access vector elements


v_char[1] # Returns [1] "One"
v_char[1:3] # Returns [1] "One" "Two" "Three"

When you use a negat ive index, it just removes the specified index vector and returns the result.
For example, the following creates a vector slice wit h the second member removed.

# Access vector elements


v_char2 = c('Four','Five','Six')
v_char2[-2]

# Output
#[1] "Four" "Six"

If an index is used that is not present, NA is returned as a result.

# Access vector elements


v_char2[5]

# Output
#[1] NA

10. Change Values in Vector

By using an index let’s see how to modify or change the value of an index. The following
example changes the value of the first member of the vector to the character string Oneeeeeeee.

# Change vector elements


v_char[1] <- "Oneeeeeeee"
print(v_char)
# Output
[1] "Oneeeeeeee" "Two" "Three"

11. R Vectors vs list

Below are some differences between vector vs list.

 Vector can hold only elements of the same type. whereas the List can hold elements of
different types like numeric, character, logical e.t.c
 Vector is a single dimension, whereas the List is a mult i-dimensional object.
 Vector is not recursive, whereas the List is recursive.

# Create list
list_obj = list('One',2,TRUE)
print(list_obj)

# Output[[1]]
#[[1]]
#[1] "One"

#[[2]]
#[1] 2

#[[3]]
#[1] TRUE

12. Handling NA Values in R Vector

NA is considered a ‘Missing Values’ or ‘Not Available’ in R and to remove NA values from the
vector use na.rm() or is.na() funct ions. Additionally R also supports several other met hods to
delete NA from the vector.
When you run an arit hmet ic operation on the R vector that has NA values, you will get results as
NA. In general, any operation on an NA becomes a NA. hence, It’s best pract ice to clean up the
vector by removing all NA values before performing an arit hmet ic operat ion.

# Quick Examples

# Create Vector
ids <- c(10,5,NA,15,10)

# Using is.na()
x <- ids[!is.na(ids)]

# Using na.rm=TRUE
x <- max(ids, na.rm=TRUE)

# Using na.omit()
x <- na.omit(ids)

# Remove NA from string Vector.


states = c('NY','DC',NA,'CA','NV',NA)
x <- na.omit(states)

# Using exclude
x <- na.exclude(states)
print(x)
13. Sorting Vectors

To sort a vector alphabet ically using the sort() funct ion that takes a vector as an argument and
returns an alphabet ically ordered value for the character vector and ascending order for numeric.

# Sort Character Vector


v <- c('D','R','G','A','X')
sort(v)
sort(v, decreasing=TRUE)

# Output
#> sort(v)
#[1] "A" "D" "G" "R" "X"
#>sort(v, decreasing=TRUE)
#[1] "X" "R" "G" "D" "A"

14. Subsetting or extracting Vectors

To subset elements from the vector use bracket notation [], by using this notation we can subset
a single element from a vector, mult iple elements, and subset by range, select elements from a
list e.t.c.

# Create a vector
v <- c('A','B','C','D','E','F')

# Subset by Index Posit ion


v[1]

# Output
#> v[1]
#[1] "A"

# Subset elements from list


v[c(1,3)]
v[c(2.3,4.5)]

# Output
#> v[c(1,3)]
#[1] "A" "C"
#> v[c(2.3,4.5)]
#[1] "B" "D"
15. Removing Duplicates

Handling duplicate values is one of the challenging tasks when analyzing data. Removing
duplicates comes under data cleaning which is a challenging task in data analyt ics. Data cleaning
needs to be done before performing any operations on data as having duplicate values result s in
inconsistent results. In the R vector, By using duplicated() funct ion we can remove duplicates
from the vector.

Remove duplicates
v <- c('A','B','D','C','A','F','G','C','d','E','E')
v[!duplicated(v)]

# Output
#[1] "A" "B" "D" "C" "F" "G" "d" "E"
16. Remove Elements
To remove mult iple values from a vector, define a vector or list wit h the values you wanted to
remove and use it wit h bracket notation [] and %in% operator. The %in% operator is used to
check what values to remove.

# Remove specific value from vector


x <- c('A','B','C','D','E','F','B')
x[ !x == 'A']

# Output
#[1] "B" "C" "D" "E" "F" "B"

# Remove Mult iple Values


x <- c('A','B','C','D','E','F','B')
x[! x %in% c('A', 'D', 'E')]

# Output
#[1] "B" "C" "F" "B"

Vector recycling in R
We can see vector recycling, when we perform some kind of operations like addition, subtraction. . . .etc on two
vectors of unequal length. The vector with a small length will be repeated as long as the operation completes on
the longer vector. If we perform an addition operation on a vector of equal length the first value of vector1 is
added with the first value of vector 2 like that. The below image demonstrated operation on unequal vectors and
operation on equal vector.

So, the repetition of small length vector as long as completion of operation on long length vector is known
as vector recycling. This is the special property of vectors is available in R language. Let us see the
implementation of vector recycling.
Example 1 :

# creating vector with


# 1 to 6 values
vec1=1:6
# creating vector with 1:2
# values
vec2=1:2
# adding vector1 and vector2
print(vec1+vec2)

Output :
In vector recycling, the length of the long length vector should be the multiple of the length of a small length
vector. If not we will get a warning that longer object length is not a multiple of shorter object length. Here the
longer object length is multiple of the shortest object length. So, we didn’t get a warning message.
Example 2 :

# creating vector with 20


# to 25 values
vec1=20:25

# creating vector with 4 to


# 6 values
vec2=4:6

# adding vector1 and vector2


print(vec1+vec2)

Output :

Here also the longer object length is multiple of the shortest object length. So, we didn’t get warning message.
Example 3 :

# creating vector with 10 to 14 values


vec1=10:14

# creating vector with 3 to 5 values


vec2=3:5

# adding vector1 and vector2


print(vec1+vec2)

Output :

Here the longer object length is not multiple of the shortest object length. So, we got a warning message.

Explain How To Perform Vector Indexing And Slicing In R.


Indexing is defined as selecting an element from a vector present in a particular index. In R, indexing starts
from 1 and is denoted by [] bracket
On the other hand, slicing is defined as selecting a part of elements present in a vector by denoting the index
notation.
To perform vector indexing and slicing, we have to assign a vector
v1 <- c(100,200,300)
v2 <- c('a','b','c')
Indexing works by using brackets and passing the index position of the element as a number.
# Grab second element
v1[2]
Output: 200
v2[2]
Output: 'b'
We can perform multiple indexing also.
v1[c(1,2)]
Output: 100 200
To perform slicing we can use a colon (:) to indicate a slice of a vector. The format is:
vector[start_index:stop_index]
v <- c(1,2,3,4,5,6,7,8,9,10)
v[2:4]
Output: 2 3 4

# R program to illustrate
# dim function

# Getting the number from 1 to 9


x <- rep(1:9)
x

# Calling the dim() function to


# Set dimension of 3 * 3
dim(x) <- c(3, 3)

# Getting the numbers


# in 3 * 3 representation
x

Output:
[1] 1 Output:
[1] 1 2 3 4 5 6 7 8 9

[, 1] [, 2] [, 3]
[1, ] 1 4 7
[2, ] 2 5 8
[3, ] 3 6 9
Last Updated : 04 Jun, 2020

3, ] 3 6 9
R Matrix

In R, a two-dimensional rectangular data set is known as a matrix. A matrix is created with the help of the
vector input to the matrix function. On R matrices, we can perform addition, subtraction, multiplication, and
division operation.
In the R matrix, elements are arranged in a fixed number of rows and columns. The matrix elements are the real
numbers. In R, we use matrix function, which can easily reproduce the memory representation of the matrix. In
the R matrix, all the elements must share a common basic type.

Example

1. matrix1<-matrix(c(11, 13, 15, 12, 14, 16),nrow =2, ncol =3, byrow = TRUE)
2. matrix1

Output

[,1] [,2] [,3]


[1,] 11 13 15
[2,] 12 14 16
\

How to create a matrix in R?

Like vector and list, R provides a function which creates a matrix. R provides the matrix() function to create a
matrix. This function plays an important role in data analysis. There is the following syntax of the matrix in R:

1. matrix(data, nrow, ncol, byrow, dim_name)

data

The first argument in matrix function is data. It is the input vector which is the data elements of the matrix.

nrow

The second argument is the number of rows which we want to create in the matrix.

ncol

The third argument is the number of columns which we want to create in the matrix.

byrow

The byrow parameter is a logical clue. If its value is true, then the input vector elements are arranged by row.

dim_name

The dim_name parameter is the name assigned to the rows and columns.

Let's see an example to understand how matrix function is used to create a matrix and arrange the elements
sequentially by row or column.

Matrix is a two dimensional data structure in R programming.

Matrix is similar to vector but additionally contains the dimension attribute. All attributes of an object can be
checked with the attributes() function (dimension can be checked directly with the dim() function).
We can check if a variable is a matrix or not with the class() function.

How to create a matrix in R programming?


A matrix can be created using the matrix() function.
Dimension of the matrix can be defined by passing appropriate values for arguments nrow and ncol.
Providing value for both dimensions is not necessary. If one of the dimensions is provided, the other is inferred
from the length of the data.

matrix(1:9, nrow = 3, ncol = 3)

Output

[,1] [,2] [,3]


[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9

Same result is obtained by providing only one dimension. That is:

matrix(1:9, nrow = 3)

Output

[,1] [,2] [,3]


[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9

We can see that the matrix is filled column-wise. This can be reversed to row-wise filling by passing TRUE to
the argument byrow.

# fill matrix row-wise


matrix(1:9, nrow=3, byrow=TRUE)

Output

[,1] [,2] [,3]


[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9

In all cases, however, a matrix is stored in column-major order internally as we will see in the subsequent
sections.

It is possible to name the rows and columns of matrix during creation by passing a 2 element list to the
argument dimnames:

x <- matrix(1:9, nrow = 3, dimnames = list(c("X","Y","Z"), c("A","B","C")))


x

Output

ABC
X147
Y258
Z369

These names can be accessed or changed with two helpful functions colnames() and rownames().

# access column names and rownames


colnames(x)
rownames(x)

# change column names


colnames(x) <- c("C1","C2","C3")
# change row names
rownames(x) <- c("R1","R2","R3")
x

Output

[1] "A" "B" "C"


[1] "X" "Y" "Z"
C1 C2 C3
R1 1 4 7
R22 5 8
R3 3 6 9

Create Matrix Using cbind() and rbind()


Another way of creating a matrix is by using functions cbind() and rbind() as in column bind and row bind.

cbind(c(1,2,3),c(4,5,6))
rbind(c(1,2,3),c(4,5,6))

Output

[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6

Here, the cbind() function is used to combine two vectors, c(1, 2, 3) and c(4, 5, 6), by column.
The resulting matrix will have two columns, where the first column contains the elements of the first vector and
the second column contains the elements of the second vector.

Similarly, the rbind() function is used to combine two vectors, c(1, 2, 3) and c(4, 5, 6), by row.
The resulting matrix will have two rows, where the first row contains the elements of the first vector and the
second row contains the elements of the second vector.

Create Matrix Using dim()


Finally, you can also create a matrix from a vector by setting its dimension using dim().

x <- c(1,2,3,4,5,6)

dim(x) <- c(2,3)


x

Output

[,1] [,2] [,3]


[1,] 1 3 5
[2,] 2 4 6
Note: We can check if a variable is a matrix or not with the class() function. For example, in the above matrix,
we can check that as:

x <- c(1,2,3,4,5,6)
class(x)

dim(x) <- c(2,3)


x
class(x)

Output

[1] "numeric"
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
[1] "matrix" "array"

How to access (indexing and Slicing )Elements of a matrix?


We can access elements of a matrix using the square bracket [] indexing method. Elements can be accessed
as var[row, column]. Here row and column are vectors.
Let's learn various ways to access elements of a matrix.

Using integer vector as index


We specify the row numbers and column numbers as vectors and use it for indexing.

If any field inside the bracket is left blank, it selects all.

We can use negative integers to specify rows or columns to be excluded.

x <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9), nrow = 3, ncol = 3, byrow = TRUE)


# select rows 1 & 2 and columns 2 & 3
x[c(1,2),c(2,3)]

# leaving column field blank will select entire columns


x[c(3,2),]

# leaving row as well as column field blank will select entire matrix
x[,]

# select all rows except first


x[-1,]

Output

[,1] [,2]
[1,] 2 3
[2,] 5 6
[,1] [,2] [,3]
[1,] 7 8 9
[2,] 4 5 6
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
[,1] [,2] [,3]
[1,] 4 5 6
[2,] 7 8 9

One thing to notice here is that, if the matrix returned after indexing is a row matrix or column matrix, the result
is given as a vector. In the above example if we do:

x[1,]
class(x[1,])

We get,

Output

[1] 1 2 3
[1] "numeric"

This behavior can be avoided by using the argument drop = FALSE while indexing. If we do:

x[1,,drop=FALSE] # now the result is a 1X3 matrix rather than a vector

class(x[1,,drop=FALSE])

Then we get,

Output

[,1] [,2] [,3]


[1,] 1 2 3
[1] "matrix" "array"

It is possible to index a matrix with a single vector.

While indexing in such a way, it acts like a vector formed by stacking columns of the matrix one after another.
The result is returned as a vector.

Using logical vector as index


Two logical vectors can be used to index a matrix. In such a situation, rows and columns where the value
is TRUE is returned. These indexing vectors are recycled if necessary and can be mixed with integer vectors.

x <- matrix(c(4, 6, 1, 8, 0, 2, 3, 7, 9), nrow = 3, ncol = 3, byrow = TRUE)


x[c(TRUE,FALSE,TRUE),c(TRUE,TRUE,FALSE)]

# the 2 element logical vector is recycled to 3 element vector


x[c(TRUE,FALSE),c(2,3)]

Output

[,1] [,2]
[1,] 4 6
[2,] 3 7
[,1] [,2]
[1,] 6 1
[2,] 7 9

It is also possible to index using a single logical vector where recycling takes place if necessary.

x[c(TRUE, FALSE)]

In the above example, the matrix x is treated as a vector formed by stacking columns of the matrix one after
another, i.e., (4,6,1,8,0,2,3,7,9).
The indexing logical vector is also recycled and thus alternating elements are selected. This property is utilized
for filtering of matrix elements as shown below.

# select elements greater than 5


x[x>5]
# select even elements
x[x%%2 == 0]

Output

[1] 6 8 7 9
[1] 4 6 8 0 2

Using character vector as index


Indexing with a character vector is possible for matrices with a named row or column. This can be mixed with
integer or logical indexing.

# create a matrix with specified values and column names


x <- matrix(c(4, 6, 1, 8, 0, 2, 3, 7, 9), nrow = 3, ncol = 3, byrow = TRUE, dimnames = list(NULL, c("A", "B",
"C")))

# subset the matrix by selecting the "A" column


x[,"A"]

# subset the matrix by selecting rows that are TRUE and columns "A" and "C"
x[TRUE, c("A", "C")]

# subset the matrix by selecting rows 2 to 3 and columns "A" and "C"
x[2:3, c("A", "C")]

Output

[1] 4 8 3
AC
[1,] 4 1
[2,] 8 2
[3,] 3 9
AC
[1,] 8 2
[2,] 3 9
How to modify a matrix in R?
We can combine the assignment operator with the above learned methods for accessing elements of a matrix to
modify it. For example,

x <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9), nrow = 3, ncol = 3, byrow = TRUE)

# modify a single element


x[2,2] <- 10
x
# modify elements less than 5
x[x<5] <- 0
x

Output

[,1] [,2] [,3]


[1,] 1 2 3
[2,] 4 10 6
[3,] 7 8 9
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 10 6
[3,] 7 8 9

Transpose a Matrix
A common operation with matrices is to transpose it. This can be done with the function t().

x <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9), nrow = 3, ncol = 3, byrow = TRUE)

t(x)

Output

[,1] [,2] [,3]


[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9

We can add rows or columns using the rbind() and cbind() function respectively. Similarly, it can be removed
through reassignment.

x <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9), nrow = 3, ncol = 3, byrow = TRUE)


# add a column
cbind(x, c(1, 2, 3))

# add a row
rbind(x,c(1,2,3))

# remove last row


x <- x[1:2,]; x

Output

[,1] [,2] [,3] [,4]


[1,] 1 2 31
[2,] 4 5 6 2
[3,] 7 8 9 3
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
[4,] 1 2 3
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6

Dimension of the matrix can be modified as well, using the dim() function.

x <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, ncol = 3, byrow = TRUE)


# change to 3X2 matrix
dim(x) <- c(3,2)
x

Output

[,1] [,2]
[1,] 1 5
[2,] 4 3
[3,] 2 6

Matrix Arithmetic and operations

In R, we can perform the mathematical operations on a matrix such as addition, subtraction, multiplication, etc.
For performing the mathematical operation on the matrix, it is required that both the matrix should have the
same dimensions.
Let see an example to understand how mathematical operations are performed on the matrix.

Example 1

R <- matrix(c(5:16), nrow = 4,ncol=3)


S <- matrix(c(1:12), nrow = 4,ncol=3)

#Addition
sum<-R+S
print(sum)

#Subtraction
sub<-R-S
print(sub)

#Multiplication
mul<-R*S
print(mul)

#Multiplication by constant
mul1<-R*12
print(mul1)

#Division
div<-R/S
print(div)

Output

[,1] [,2] [,3]


[1,] 6 14 22
[2,] 8 16 24
[3,] 10 18 26
[4,] 12 20 28

[,1] [,2] [,3]


[1,] 4 4 4
[2,] 4 4 4
[3,] 4 4 4
[4,] 4 4 4

[,1] [,2] [,3]


[1,] 5 45 117
[2,] 12 60 140
[3,] 21 77 165
[4,] 32 96 192

[,1] [,2] [,3]


[1,] 60 108 156
[2,] 72 120 168
[3,] 84 132 180
[4,] 96 144 192

[,1] [,2] [,3]


[1,] 5.000000 1.800000 1.444444
[2,] 3.000000 1.666667 1.400000
[3,] 2.333333 1.571429 1.363636
[4,] 2.000000 1.500000 1.333333
Categorical Data
Creating categories in R is a common task when working with data. Categories are also known as factors or
categorical variables, and they are used to represent discrete groups or levels of a variable. There are different
ways to create categories in R, depending on the type and structure of the data.
One way to create categories in R is to use the factor () function. This function takes a vector of values and
converts it into a factor with a specified number of levels. You can also provide labels for each level, or use the
default labels based on the order of the values. For example, if you have a vector of letters, you can create a
factor with four levels and label them as A, B, C, and D:
# Create a vector of letters
letters <- c("A", "B", "B", "C", "D")

# Create a factor with four levels


factor_letters <- factor(letters)

# View the factor


factor_letters
[1] A B B C D
Levels: A B C D

Introduction to Factors in R

Factors play a crucial role in data analysis. Learn how to create, subset, and compare them.

What is a Factor?
A factor refers to a statistical data type used to store categorical variables. Categorical variables belong to
a limited number of categories. Continuous variables, on the other hand, can correspond to an infinite number
of values.

It is important to distinguish whether you are working with a categorical variable or a continuous variable, as
statistical models treat both types differently.

An example of a categorical variable is sex. In most cases, you can limit the categories to “Male” or “Female”.
While there may be circumstances where you may include additional categories (whether to take into
consideration chromosomal variation, hermaphroditic animals, or different cultural norms), you will always have
a finite number of categories.

To create factors in R, use the factor() function. The factor function is used to encode a vector as a factor (other
terms for factors are ‘category’ and ‘enumerated type’). For example, sex_vector contains the sex of 5 different
individuals:
sex_vector <- c("Male", "Female", "Female", "Male", "Male")

It is clear that there are two categories (factor levels): “Male” and “Female”.

The function factor() will encode the vector as a factor:


factor_sex_vector <- factor(sex_vector)

Here is an example of the factor() function in action.

Types of Categorical Variables

There are two different types of categorical variables:

 A nominal categorical variable, which is a categorical variable without an implied order. It is


impossible to say that ‘one is worth more than the other.’ For example, consider the
categories "Elephant" , "Giraffe" , "Donkey" , and "Horse" . It is impossible to say that one stands
above or below the other.

 An ordinal categorical variable, which has a natural ordering. Consider the


categories: "Low" , "Medium" , and"High" . It is obvious that "Medium" ranks higher than "Low" ,
and "High" ranks above "Medium" .
Notice how R constructs and prints nominal and ordinal variables.

R constructs and prints nominal and ordinal variables differently. For nominal variables, there is no ordering, so
R prints them without indicating any ordering.
# Animals
[1] Elephant Giraffe Donkey Horse
Levels: Donkey Elephant Giraffe Horse

For ordinal variables, R indicates order using < .


# Temperature
[1] High Low High Low Medium
Levels: Low < Medium < High

Factor Levels
Sometimes, it can be helpful to change the names of specific factor levels in a data set for clarity or other
reasons. You can do this using the levels() function:
levels(factor_vector) <- c("name1", "name2", ...)

For example, suppose the raw data from a survey contains the a question regarding the sex of the respondent and
just two categories were recorded: "M" and "F" .
survey_vector <- c("M", "F", "F", "M", "M")
Recording the sex with abbreviations can be convenient while collecting data (especially with pen and paper),
but it can become confusing once it is time to analyze the data. At that point, you may want to change the factor
levels to "Male" and "Female" instead of "M" and "F" for clarity.

Caution: the order with which you assign the levels is important. levels(factor_survey_vector) outputs [1] "F"
"M" . If you don’t specify the levels of the factor when you are creating a vector, R will automatically assign
them alphabetically. To correctly map "F" to "Female" and "M" to "Male" , the levels should be set
to c("Female", "Male") , in this order.

Instead of the “F” and “M” default levels from the data, we’ve renamed them to “Female” and “Male”.

Summarizing a Factor

The summary() function gives a quick overview of the contents of a variable:


summary(my_var)

Suppose we want to determine how many responses of each factor level we have in our vector. We can use
the summary() function to determine this.
summary(factor_survey_vector) is clearly more useful in this case in comparison to summary(survey_vector)

Factor Comparison
What happens when you try to compare elements of a factor? In factor_survey_vector you have a factor with two
levels: "Male" and "Female" . But how are these valued relative to each other?

Let’s test if male is greater than ( > ) female .


By default, R returns NA when you try to compare vales in a factor, since the idea doesn’t make sense.

Ordered Factors

For unordered (or nominal) factor levels, such as "Male" and "Female" , comparison operators, such as the
greater than operator, are not meaningful and R returns a warning message to reflect that.

R attaches an equal value to the levels for factors by default. However, this does not necessarily have to be the
case. Sometimes you will also deal with factors that have a natural ordering between its categories. Such factors
are called ordinal factors. In this circumstance, we must pass this information to R.

Suppose you want to evaluate the performance five analysts. To do this, you track their speed as
either "slow" , "medium" , or "fast" . You get the following results:

 Analyst 1 is medium,

 Analyst 2 is slow,

 Analyst 3 is slow,

 Analyst 4 is medium, and

 Analyst 5 is fast.
Our first step is the construct a vector with five entries and assign the speed factor to each analyst.
# Create speed_vector
speed_vector <- c("medium", "slow", "slow", "medium", "fast")

The speed_vector should be converted to an ordinal factor since its categories have a natural ordering. That
is, "fast" is greater than "medium" , which is then greater than "slow" .

By default, the factor() function transforms the speed_vector into an unordered factor. To create an ordered
factor, two additional arguments are needed: ordered and levels .
factor(some_vector, ordered = TRUE, levels = c("lev1", "lev2", ...))

By setting the argument ordered to TRUE , you indicate that the factor is ordered. With the argument levels you
give the values of the factor in the correct order.

For you to try


Create an ordered vector from the speed vector. The correct ordering of the factor is c("slow", "medium",
"fast") .

Solution
# Create speed_vector
speed_vector <- c("medium", "slow", "slow", "medium", "fast")# Convert speed_vector to ordered factor vector
factor_speed_vector <- factor(speed_vector, ordered=TRUE, levels=c("slow", "medium", "fast"))# Print
factor_speed_vector
factor_speed_vector
summary(factor_speed_vector)
Comparing Ordered Factors
You can compare different elements of an ordered factor by using the well-known operators. For example, to
compare if the the element of the first factor vector is greater than the first element of the second factor vector,
you would use the greater than operator ( > ).
factor_vector1[1] > factor_vector2[1]

For you to try


Suppose ‘data analyst number two’ complains that ‘data analyst number five’ is slowing down the entire project.
Check if their statement is true.

Solution
# Create factor_speed_vector
speed_vector <- c("medium", "slow", "slow", "medium", "fast")
factor_speed_vector <- factor(speed_vector, ordered = TRUE, levels = c("slow", "medium", "fast"))# Factor value for
second data analyst
da2 <- factor_speed_vector[2]# Factor value for fifth data analyst
da5 <- factor_speed_vector[5]# Is data analyst 2 faster than data analyst 5?
da2 > da5
The results tell us that data analyst two is complaining about data analyst five while they are in fact the one
slowing everything down.

R Lists
If a vector has elements of different types, it is called a list in R programming.
A list is a flexible data structure that can hold elements of different types, such as numbers, characters, vectors,
matrices, and even other lists.

How to create a list in R programming?


We can create a list using the list() function.

x <- list("a" = 2.5, "b" = TRUE, "c" = 1:3)


x

Output

$a
[1] 2.5
$b
[1] TRUE
$c
[1] 1 2 3

Here, we create a list x, of three components with data types double, logical and integer vector respectively.
The structure of the above list can be examined with the str() function

str(x)

Output

List of 3
$ a:num 2.5
$ b:logi TRUE
$ c:int [1:3] 1 2 3

In this example, a, b and c are called tags which makes it easier to reference the components of the list.
However, tags are optional. We can create the same list without the tags as follows. In such a scenario, numeric
indices are used by default.

x <- list(2.5,TRUE,1:3)
x

Output

[[1]]
[1] 2.5
[[2]]
[1] TRUE
[[3]]
[1] 1 2 3

How to access components of a list?


Lists can be accessed in similar fashion to vectors. Integer, logical or character vectors can be used for
indexing. Let us consider a list as follows.

x <- list(name = "John", age = 19, speaks = c("English", "French"))

# access elements by name


x$name
x$age
x$speaks

# access elements by integer index


x[c(1, 2)]
x[-2]

# access elements by logical index


x[c(TRUE, FALSE, FALSE)]

# access elements by character index


x[c("age", "speaks")]

Output

[1] "John"
[1] 19
[1] "English" "French"
$name
[1] "John"

$age
[1] 19

$name
[1] "John"

$speaks
[1] "English" "French"

$name
[1] "John"

$age
[1] 19

$speaks
[1] "English" "French"

Indexing with [ as shown above will give us a sublist not the content inside the component. To retrieve the
content, we need to use [[.
However, this approach will allow us to access only a single component at a time.

x <- list(name = "John", age = 19, speaks = c("English", "French"))

# access element by name using single bracket []


x["age"]

# check the type of the result (single bracket returns a list)


typeof(x["age"])
# access element by name using double bracket [[]]
x[["age"]]

# check the type of the result (double bracket returns the content)
typeof(x[["age"]])

Output

$age
[1] 19

[1] "list"
[1] 19
[1] "double"

An alternative to [[, which is used often while accessing content of a list is the $ operator. They are both the
same, except that $ can do partial matching on tags.

x <- list(name = "John", age = 19, speaks = c("English", "French"))

# access element by exact matching using $


x$name

# access element by partial matching using $


x$age

# access element by partial matching using $


x$speaks

# create a list with similar tags


y <- list(n = "Alice", a = 25, s = c("Spanish", "Italian"))

# access element by partial matching using $


y$n

# access element by partial matching using $


y$a

# access element by partial matching using $


y$s

Output

[1] "John"
[1] 19
[1] "English" "French"
[1] "Alice"
[1] 25
[1] "Spanish" "Italian"

How to modify a list in R?


We can change components of a list through reassignment. We can choose any of the component accessing
techniques discussed above to modify it.

Notice below that modification causes reordering of components.

x <- list(name = "John", age = 19, speaks = c("English", "French"))

# access element by double brackets [[]] and update its value


x[["name"]] <- "Clair"

# print the updated list


x

Output

$name
[1] "Clair"

$age
[1] 19

$speaks
[1]"English" "French"

How to add components to a list?


Adding new components is easy. We simply assign values using new tags and it will pop into action.

x <- list(name = "Clair", age = 19, speaks = c("English", "French"))

# assign a new element to the list using double brackets [[]]


x[["married"]] <- FALSE

# print the updated list


x

Output

$name
[1] "Clair"

$age
[1] 19

$speaks
[1] "English" "French"

$married
[1] FALSE
How to delete components from a list?
We can delete a component by assigning NULL to it.

x <- list(name = "Clair", age = 19, speaks = c("English", "French"))

# remove an element from the list using double brackets [[]]


x[["age"]] <- NULL

# print the structure of the updated list


str(x)

# remove an element from the list using $ notation


x$married <- NULL

# print the structure of the updated list


str(x)

Output

List of 2
$ name :chr "Clair"
$ speaks:chr [1:2] "English" "French"
List of 2
$ name :chr "Clair"
$ speaks:chr [1:2] "English" "French"

R - Data Frames

A data frame is a table or a two-dimensional array-like structure in which each column contains values of one
variable and each row contains one set of values from each column.

Following are the characteristics of a data frame.

 The column names should be non-empty.


 The row names should be unique.
 The data stored in a data frame can be of numeric, factor or character type.
 Each column should contain same number of data items.

Create Data Frame

Live Demo

# Create the data frame.


emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)
# Print the data frame.
print(emp.data)

When we execute the above code, it produces the following result −

emp_id emp_name salary start_date


1 1 Rick 623.30 2012-01-01
2 2 Dan 515.20 2013-09-23
3 3 Michelle 611.00 2014-11-15
4 4 Ryan 729.00 2014-05-11
5 5 Gary 843.25 2015-03-27

Get the Structure of the Data Frame

The structure of the data frame can be seen by using str() function.

Live Demo

# Create the data frame.


emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)
# Get the structure of the data frame.
str(emp.data)

When we execute the above code, it produces the following result −

'data.frame': 5 obs. of 4 variables:


$ emp_id : int 1 2 3 4 5
$ emp_name : chr "Rick" "Dan" "Michelle" "Ryan" ...
$ salary : num 623 515 611 729 843
$ start_date: Date, format: "2012-01-01" "2013-09-23" "2014-11-15" "2014-05-11" ...

Summary of Data in Data Frame

The statistical summary and nature of the data can be obtained by applying summary() function.

Live Demo

# Create the data frame.


emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)
# Print the summary.
print(summary(emp.data))

When we execute the above code, it produces the following result −

emp_id emp_name salary start_date


Min. :1 Length:5 Min. :515.2 Min. :2012-01-01
1st Qu.:2 Class :character 1st Qu.:611.0 1st Qu.:2013-09-23
Median :3 Mode :character Median :623.3 Median :2014-05-11
Mean :3 Mean :664.4 Mean :2014-01-14
3rd Qu.:4 3rd Qu.:729.0 3rd Qu.:2014-11-15
Max. :5 Max. :843.2 Max. :2015-03-27

Extract Data from Data Frame

Extract specific column from a data frame using column name.

Live Demo

# Create the data frame.


emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Extract Specific columns.
result <- data.frame(emp.data$emp_name,emp.data$salary)
print(result)

When we execute the above code, it produces the following result −

emp.data.emp_name emp.data.salary
1 Rick 623.30
2 Dan 515.20
3 Michelle 611.00
4 Ryan 729.00
5 Gary 843.25

Extract the first two rows and then all columns

Live Demo

# Create the data frame.


emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)
# Extract first two rows.
result <- emp.data[1:2,]
print(result)
When we execute the above code, it produces the following result −

emp_id emp_name salary start_date


1 1 Rick 623.3 2012-01-01
2 2 Dan 515.2 2013-09-23

Extract 3rd and 5th row with 2nd and 4th column

Live Demo

# Create the data frame.


emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)

# Extract 3rd and 5th row with 2nd and 4th column.
result <- emp.data[c(3,5),c(2,4)]
print(result)

When we execute the above code, it produces the following result −

emp_name start_date
3 Michelle 2014-11-15
5 Gary 2015-03-27

Expand Data Frame

A data frame can be expanded by adding columns and rows.

Add Column

Just add the column vector using a new column name.

Live Demo

# Create the data frame.


emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)

# Add the "dept" coulmn.


emp.data$dept <- c("IT","Operations","IT","HR","Finance")
v <- emp.data
print(v)

When we execute the above code, it produces the following result −


emp_id emp_name salary start_date dept
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance

Add Row

To add more rows permanently to an existing data frame, we need to bring in the new rows in the same
structure as the existing data frame and use the rbind() function.

In the example below we create a data frame with new rows and merge it with the existing data frame to create
the final data frame.

Live Demo

# Create the first data frame.


emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
dept = c("IT","Operations","IT","HR","Finance"),
stringsAsFactors = FALSE
)

# Create the second data frame


emp.newdata <- data.frame(
emp_id = c (6:8),
emp_name = c("Rasmi","Pranab","Tusar"),
salary = c(578.0,722.5,632.8),
start_date = as.Date(c("2013-05-21","2013-07-30","2014-06-17")),
dept = c("IT","Operations","Fianance"),
stringsAsFactors = FALSE
)

# Bind the two data frames.


emp.finaldata <- rbind(emp.data,emp.newdata)
print(emp.finaldata)

When we execute the above code, it produces the following result −

emp_id emp_name salary start_date dept


1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance
6 6 Rasmi 578.00 2013-05-21 IT
7 7 Pranab 722.50 2013-07-30 Operations
8 8 Tusar 632.80 2014-06-17 Fianance
R - Packages

R packages are a collection of R functions, complied code and sample data. They are stored under a directory
called "library" in the R environment. By default, R installs a set of packages during installation. More
packages are added later, when they are needed for some specific purpose. When we start the R console, only
the default packages are available by default. Other packages which are already installed have to be loaded
explicitly to be used by the R program that is going to use them.

All the packages available in R language are listed at R Packages.

Below is a list of commands to be used to check, verify and use the R packages.

Check Available R Packages

Get library locations containing R packages

Live Demo

.libPaths()

When we execute the above code, it produces the following result. It may vary depending on the local settings
of your pc.

[2] "C:/Program Files/R/R-3.2.2/library"

Get the list of all the packages installed

Live Demo

library()

When we execute the above code, it produces the following result. It may vary depending on the local settings
of your pc.

Packages in library ‘C:/Program Files/R/R-3.2.2/library’:

base The R Base Package


boot Bootstrap Functions (Originally by Angelo Canty
for S)
class Functions for Classification
cluster "Finding Groups in Data": Cluster Analysis
Extended Rousseeuw et al.
codetools Code Analysis Tools for R
compiler The R Compiler Package
datasets The R Datasets Package
foreign Read Data Stored by 'Minitab', 'S', 'SAS',
'SPSS', 'Stata', 'Systat', 'Weka', 'dBase', ...
graphics The R Graphics Package
grDevices The R Graphics Devices and Support for Colours
and Fonts
grid The Grid Graphics Package
KernSmooth Functions for Kernel Smoothing Supporting Wand
& Jones (1995)
lattice Trellis Graphics for R
MASS Support Functions and Datasets for Venables and
Ripley's MASS
Matrix Sparse and Dense Matrix Classes and Methods
methods Formal Methods and Classes
mgcv Mixed GAM Computation Vehicle with GCV/AIC/REML
Smoothness Estimation
nlme Linear and Nonlinear Mixed Effects Models
nnet Feed-Forward Neural Networks and Multinomial
Log-Linear Models
parallel Support for Parallel computation in R
rpart Recursive Partitioning and Regression Trees
spatial Functions for Kriging and Point Pattern
Analysis
splines Regression Spline Functions and Classes
stats The R Stats Package
stats4 Statistical Functions using S4 Classes
survival Survival Analysis
tcltk Tcl/Tk Interface
tools Tools for Package Development
utils The R Utils Package

Get all packages currently loaded in the R environment

Live Demo

search()

When we execute the above code, it produces the following result. It may vary depending on the local settings
of your pc.

[1] ".GlobalEnv" "package:stats" "package:graphics"


[4] "package:grDevices" "package:utils" "package:datasets"
[7] "package:methods" "Autoloads" "package:base"

Install a New Package

There are two ways to add new R packages. One is installing directly from the CRAN directory and another is
downloading the package to your local system and installing it manually.

Install directly from CRAN

The following command gets the packages directly from CRAN webpage and installs the package in the R
environment. You may be prompted to choose a nearest mirror. Choose the one appropriate to your location.

install.packages("Package Name")

# Install the package named "XML".


install.packages("XML")

Install package manually

Go to the link R Packages to download the package needed. Save the package as a .zip file in a suitable location
in the local system.

Now you can run the following command to install this package in the R environment.

install.packages(file_name_with_path, repos = NULL, type = "source")

# Install the package named "XML"


install.packages("E:/XML_3.98-1.3.zip", repos = NULL, type = "source")
Load Package to Library

Before a package can be used in the code, it must be loaded to the current R environment. You also need to load
a package that is already installed previously but not available in the current environment.

A package is loaded using the following command −

library("package Name", lib.loc = "path to library")

# Load the package named "XML"


install.packages("E:/XML_3.98-1.3.zip", repos = NULL, type = "source")

What is Tidyverse?

Tidyverse is a collection of essential R packages for data science. The packages under the tidyverse umbrella
help us in performing and interacting with the data. There are a whole host of things you can do with your data,
such as subsetting, transforming, visualizing, etc.

Tidyverse packages

Installation and use


 Install all the packages in the tidyverse by running install.packages("tidyverse").
 Run library(tidyverse) to load the core tidyverse and make it available in your current R session.
Learn more about the tidyverse package at https://tidyverse.tidyverse.org.
Core tidyverse

The core tidyverse includes the packages that you’re likely to use in everyday data analyses. As of tidyverse
1.3.0, the following packages are included in the core tidyverse:

ggplot2
ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the
data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the
details. Go to docs...

dplyr
dplyr provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common
data manipulation challenges. Go to docs...

tidyr
tidyr provides a set of functions that help you get to tidy data. Tidy data is data with a consistent form: in brief,
every variable goes in a column, and every column is a variable. Go to docs...

readr
readr provides a fast and friendly way to read rectangular data (like csv, tsv, and fwf). It is designed to flexibly
parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes. Go to
docs...

purrr
purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for
working with functions and vectors. Once you master the basic concepts, purrr allows you to replace many for
loops with code that is easier to write and more expressive. Go to docs...

tibble
tibble is a modern re-imagining of the data frame, keeping what time has proven to be effective, and throwing
out what it has not. Tibbles are data.frames that are lazy and surly: they do less and complain more forcing you
to confront problems earlier, typically leading to cleaner, more expressive code. Go to docs...

stringr
stringr provides a cohesive set of functions designed to make working with strings as easy as possible. It is built
on top of stringi, which uses the ICU C library to provide fast, correct implementations of common string
manipulations. Go to docs...

forcats
forcats provides a suite of useful tools that solve common problems with factors. R uses factors to handle
categorical variables, variables that have a fixed and known set of possible values. Go to docs...
The tidyverse also includes many other packages with more specialised usage. They are not loaded
automatically with library(tidyverse), so you’ll need to load each one with its own call to library().

How To Import Data from a File in R Programming



The collection of facts is known as data. Data can be in different forms. To analyze data using R programming
Language, data should be first imported in R which can be in different formats like txt, CSV, or any other
delimiter separated files. After importing data then manipulate, analyze, and report it.
Import Data from a File in R Programming Language
In this article, we are going to see how to import different files in R programming Language.
Import CSV file into R

Method 1: Using read.csv() methods.

Here we will import csv file using read.csv() method in R.


Syntax: read.csv(path, header = TRUE, sep = “,”)
Arguments :
 path : The path of the file to be imported
 header : By default : TRUE . Indicator of whether to import column headings.
 sep = “,” : The separator for the values in each row.

# specifying the path


path <- "/data.csv"

# reading contents of csv file


content <- read.csv(path)

# contents of the csv file


print (content)

Output:
ID Name Post Age
1 5 H CA 67
2 6 K SDE 39
3 7 Z Admin 28
Method 2: Using read.table() methods.
Here we will use read.table() methods to import CSV file into R Programming Language.

# simple R program to read csv file using read.table()

x <- read.csv2("D://Data//myfile.csv", header = TRUE, sep=", ")

# print x
print(x)

Output:
Col1.Col2.Col3
1 100, a1, b1
2 200, a2, b2
3 300, a3, b3
Importing Data from a Text File
We can easily import or read .txt file using basic R function read.table(). read.table() is used to read a file in
table format. This function is easy to use and flexible.
Syntax:
# read data stored in .txt file
x<-read.table(“file_name.txt”, header=TRUE/FALSE)

# Simple R program to read txt file


x<-read.table("D://Data//myfile.txt", header=FALSE)

# print x
print(x)

Output:
V1 V2 V3
1 100 a1 b1
2 200 a2 b2
3 300 a3 b3
If the header argument is set at TRUE, which reads the column names if they exist in the file.
Importing Data from a delimited file
R has a function read.delim() to read the delimited files into the list. The file is by default separated by a tab
which is represented by sep=””, that separated can be a comma(, ), dollar symbol($), etc.
Syntax: read.delim(“file_name.txt”, sep=””, header=TRUE)

x <- read.delim("D://Data//myfile.csv", sep="|", header=TRUE)

# print x
print(x)

# print type of x
typeof(x)

Output:
X.V1.V2.V3
1 1, 100, a1, b1
2 2, 200, a2, b2
3 3, 300, a3, b3

[1] "list
Importing Json file in R
Here we are going to use rjson package to import the JSON file into R Programming Language.

# Read a JSON file

# Load the package required to read JSON files.


library("rjson")

# Give the input file name to the function.


res <- fromJSON(file = "E:\\exp.json")

# Print the result.


print(res)

Output:
$ID
[1] "1" "2" "3" "4" "5"

$Name
[1] "Mithuna" "Tanushree" "Parnasha" "Arjun" "Pankaj"
$Salary
[1] "722.5" "815.2" "1611" "2829" "843.25"
Importing XML file in R
To import XML file here we are going to use XML Package in R Programming language.
XML file for demonestration:

<RECORDS>
<STUDENT>
<ID>1</ID>
<NAME>Alia</NAME>

<MARKS>620</MARKS>
<BRANCH>IT</BRANCH>
</STUDENT>
<STUDENT>
<ID>2</ID>
<NAME>Brijesh</NAME>

<MARKS>440</MARKS>
<BRANCH>Commerce</BRANCH>
</STUDENT>
<STUDENT>
<ID>3</ID>
<NAME>Yash</NAME>

<MARKS>600</MARKS>
<BRANCH>Humanities</BRANCH>
</STUDENT>
<STUDENT>
<ID>4</ID>
<NAME>Mallika</NAME>
<MARKS>660</MARKS>

<BRANCH>IT</BRANCH>
</STUDENT>
<STUDENT>
<ID>5</ID>
<NAME>Zayn</NAME>
<MARKS>560</MARKS>
<BRANCH>IT</BRANCH>
</STUDENT>
</RECORDS>

Reading XML file:


It can be read after installing the package and then parsing it with xmlparse() function.

# loading the library and other important packages


library("XML")

library("methods")

# the contents of sample.xml are parsed


data <- xmlParse(file = "sample.xml")

print(data)

Output:
1
Alia
620
IT
2
Brijesh
440
Commerce
3
Yash
600
Humanities
4
Mallika
660
IT
5
Zayn
560
IT
Reading the CSV file into Data frames in R

What is a CSV file?

CSV is expanded as Comma, Separated, Values. In this file, the values stored are separated by a comma. This
process of storing the data is much easier.

Why CSV is the most used file format for data storing?

Storing the data in an excel sheet is the most common practice in many companies. In the majority of firms,
people are storing data as comma-separated-values (CSV), as the process is easier than creating normal
spreadsheets. Later they can use R’s built in packages to read and analyze the data.

Being the most popular and powerful statistical analysis programming language, R offers specific functions to
read data into organized data frames from a CSV file.

Reading CSV File to Data Frame

In this short example, we will see how we can read a CSV file into organized data frames.

The first thing in this process is to getting and setting up the working directory. You need to choose the
working path of the CSV file.

1. Setting up the working directory

Here you can check the default working directory using getwd() function and you can also change the directory
using the function setwd().

>getwd() #Shows the default working directory

----> "C:/Users/Dell/Documents"

> setwd("C:\Users\Dell\Documents\R-test data") #to set the new working Directory

> getwd() #you can see the updated working directory

---> "C:/Users/Dell/Documents/R-test data"


Copy

2. Importing and Reading the dataset / CSV file

After the setting of the working path, you need to import the data set or a CSV file as shown below.
> readfile <- read.csv("testdata.txt")

Execute the above line of code in R studio to get the data frame as shown below.

To check the class of the variable ‘readfile’, execute the below code.

> class(readfile)

---> "data.frame"

In the above image you can see the data frame which includes the information of student names, their ID’s,
departments, gender and marks.

3. Extracting the student’s information from the CSV file

After getting the data frame, you can now analyse the data. You can extract particular information from the data
frame.

To extract the highest marks scored by students,

>marks <- max(data$Marks.Scored) #this will give you the highest marks

#To extract the details of a student who scored the highest marks,

> data <- read.csv("traindata.csv")

> Marks <- max(data$Marks.Scored)

> retval <- subset(data, Marks.Scored == max(Marks.Scored)) #This will


extract the details of the student who secured highest marks

> View(retval)
To extract the details of the students who are in studying in ‘chemistry’ Dept,

> readfile <- read.csv("traindata.csv")

> retval <- subset( data, Department == "chemistry") # This will extract the student details who are in
Biochemistry department

> View(retval)
Indexing and Slicing Data Frames in R
In this article let’s discuss indexing and slicing the Data Frames in R Programming Language.
What is Indexing?
The process of accessing particular data components or subsets within a vector, matrix, or data frame is called
indexing. It enables us to pick, remove, or change particular values or parts of our data based on criteria.
Indexing the Data Frame in R
By indexing the Data Frame in R we will get the particular column data. Indexing can be done by specifying
column names in square brackets. The syntax for indexing the data frame is-
dataframeName[“columnName”]
Example: In this example let’s create a Data Frame “stats” that contains runs scored and wickets taken by a
player and perform indexing on the data frame to extract runs scored by players.

# create a data frame


stats <- data.frame(player=c('A', 'B', 'C', 'D'),
runs=c(100, 200, 408, NA),

wickets=c(17, 20, NA, 5))

print("stats Dataframe")
stats

# fetch data in certain column

stats["runs"]
Output:
"stats Dataframe"
player runs wickets
1 A 100 17
2 B 200 20
3 C 408 NA
4 D NA 5
runs
1 100
2 200
3 408
4 NA

Slicing In R
The process of extracting particular subsets or sections of a vector, matrix, or data frame depending on
predetermined criteria is known as slicing. Using start and end indices, we can choose sequential ranges of
elements or subsets of data.
Slicing the Data Frame
Slicing the Data Frame gives the required rows and columns. This can be done by three ways. They are listed
below-
 Slicing with [ , ]
 Slicing with logical vectors.
 Slicing with subset().
Slicing with [ , ]
Slicing the data frame with [ , ] returns data of specified rows and columns. The syntax of this is mentioned
below-
dataframeName[ fromRow : toRow , columnNumber]
Example:
In the below code we performed slicing on the data frame to fetch specified rows and columns.

# create a data frame


stats <- data.frame(player=c('A', 'B', 'C', 'D'),

runs=c(100, 200, 408, NA),


wickets=c(17, 20, NA, 5))

print("stats Dataframe")
stats

# fetch 2,3 rows and 1,2 columns


stats[2:3,c(1,2)]

# fetch 1:3 rows of 1st column


cat("players - ")
stats[1:3,1]

Output
"stats Dataframe"
player runs wickets
1 A 100 17
2 B 200 20
3 C 408 NA
4 D NA 5
player runs
2 B 200
3 C 408
the
players - [1] "A" "B" "C"

Slicing with logical vectors


We can perform slicing on the data by specifying the logical conditions. It is used to get the filtered data.
Example:
In this example, we fetch the records of players who scored more than 100 runs.

# create a data frame

stats <- data.frame(player=c('A', 'B', 'C', 'D'),


runs=c(100, 200, 408, 23),
wickets=c(17, 20, 3, 5))

print("stats Dataframe")
stats

# fetch player details who scores


# more than 100 runs
batsmens<-stats[stats$runs>100,]
batsmens

Output
"stats Dataframe"
player runs wickets
1 A 100 17
2 B 200 20
3 C 408 3
4 D 23 5
player runs wickets
2 B 200 20
3 C 408 3

Slicing with subset()


We can slice data frames using the subset() method. The subset method accepts the data, filter logic to slice and
the columns to fetch. The syntax of slicing with subset is –
subset( x = dataframe, subset = filter_logic, select=c(columnNames))
Example:
In the below code, we fetched the players who picked more than 5 wickets from the data frame stats by slicing
the data frame using the subset method.

# create a data frame


stats <- data.frame(player=c('A', 'B', 'C', 'D'),
runs=c(100, 200, 408, 23),
wickets=c(17, 20, 3, 5))

print("stats Dataframe")

stats

# fetch player details who pick


# more than 5 wickets
subset(x=stats, subset=wickets>5, select=c(player,wickets))

Output
"stats Dataframe"
player runs wickets
1 A 100 17
2 B 200 20
3 C 408 3
4 D 23 5
player wickets
1 A 17
2 B 20

DataFrame Operations in R

DataFrames are generic data objects of R which are used to store the tabular data. Data frames are considered
to be the most popular data objects in R programming because it is more comfortable to analyze the data in the
tabular form. Data frames can also be taught as mattresses where each column of a matrix can be of the
different data types. DataFrame are made up of three principal components, the data, rows, and columns.

Operations that can be performed on a DataFrame are:

 Creating a DataFrame
 Accessing rows and columns
 Selecting the subset of the data frame
 Editing dataframes
 Adding extra rows and columns to the data frame
 Add new variables to dataframe based on existing ones
 Delete rows and columns in a data frame
Creating a DataFrame
In the real world, a DataFrame will be created by loading the datasets from existing storage, storage can be
SQL Database, CSV file, and an Excel file. DataFrame can also be created from the vectors in R. Following are
some of the various ways that can be used to create a DataFrame:
Creating a data frame using Vectors: To create a data frame we use the data.frame() function in R. To
create a data frame use data.frame() command and then pass each of the vectors you have created as
arguments to the function.
Example:

# R program to illustrate dataframe

# A vector which is a character vector

Name = c("Amiya", "Raj", "Asish")

# A vector which is a character vector


Language = c("R", "Python", "Java")

# A vector which is a numeric vector

Age = c(22, 25, 45)

# To create dataframe use data.frame command and


# then pass each of the vectors
# we have created as arguments
# to the function data.frame()

df = data.frame(Name, Language, Age)

print(df)

Output:
Name Language Age
1 Amiya R 22
2 Raj Python 25
3 Asish Java 45
Creating a data frame using data from a file: Dataframes can also be created by importing the data from a
file. For this, you have to use the function called ‘read.table()‘.
Syntax:
newDF = read.table(path="Path of the file")
To create a dataframe from a CSV file in R:
Syntax:
newDF = read.csv("FileName.csv")
Accessing rows and columns
The syntax for accessing rows and columns is given below,
df[val1, val2]

df = dataframe object
val1 = rows of a data frame
val2 = columns of a data frame
So, this ‘val1‘ and ‘val2‘ can be an array of values such as “1:2” or “2:3” etc. If you specify only df[val2] this
refers to the set of columns only, that you need to access from the data frame.
Example: Row selection

# R program to illustrate operations

# on a data frame

# Creating a dataframe
df = data.frame(
"Name" = c("Amiya", "Raj", "Asish"),
"Language" = c("R", "Python", "Java"),

"Age" = c(22, 25, 45)


)
print(df)

# Accessing first and second row


cat("Accessing first and second row\n")
print(df[1:2, ])

Output:
Name Language Age
1 Amiya R 22
2 Raj Python 25
3 Asish Java 45

Accessing first and second row


Name Language Age
1 Amiya R 22
2 Raj Python 25
Example: Column selection
# R program to illustrate operations
# on a data frame

# Creating a dataframe

df = data.frame(
"Name" = c("Amiya", "Raj", "Asish"),
"Language" = c("R", "Python", "Java"),
"Age" = c(22, 25, 45)
)
print(df)

# Accessing first and second column


cat("Accessing first and second column\n")
print(df[, 1:2])

Output:
Name Language Age
1 Amiya R 22
2 Raj Python 25
3 Asish Java 45

Accessing first and second column


Name Language
1 Amiya R
2 Raj Python
3 Asish Java
Selecting the subset of the DataFrame
A subset of a DataFrame can also be created based on certain conditions with the help of following syntax.
newDF = subset(df, conditions)
df = Original dataframe
conditions = Certain conditions

Example:

# R program to illustrate operations


# on a data frame
# Creating a dataframe
df = data.frame(
"Name" = c("Amiya", "Raj", "Asish"),
"Language" = c("R", "Python", "Java"),

"Age" = c(22, 25, 45)


)
print(df)

# Selecting the subset of the data frame


# where Name is equal to Amiya

# OR age is greater than 30


newDf = subset(df, Name =="Amiya"|Age>30)

cat("After Selecting the subset of the data frame\n")


print(newDf)

Output:
Name Language Age
1 Amiya R 22
2 Raj Python 25
3 Asish Java 45

After Selecting the subset of the data frame


Name Language Age
1 Amiya R 22
3 Asish Java 45
Editing DataFrames
In R, DataFrames can be edited in two ways:
Editing data frames by direct assignments: Much like the list in R you can edit the data frames by a direct
assignment.
Example:

# R program to illustrate operation on a data frame

# Creating a dataframe
df = data.frame(
"Name" = c("Amiya", "Raj", "Asish"),
"Language" = c("R", "Python", "Java"),
"Age" = c(22, 25, 45)
)
cat("Before editing the dataframe\n")

print(df)

# Editing dataframes by direct assignments


# [[3]] accessing the top level components
# Here Age in this case
# [[3]][3] accessing inner level components

# Here Age of Asish in this case


df[[3]][3] = 30

cat("After edited the dataframe\n")


print(df)

Output:
Before editing the data frame
Name Language Age
1 Amiya R 22
2 Raj Python 25
3 Asish Java 45

After edited the data frame


Name Language Age
1 Amiya R 22
2 Raj Python 25
3 Asish Java 30
Editing dataframes using the edit() command:
Follow the given steps to edit a DataFrame:
Step 1: So, what you need to do for this is you have to create an instance of data frame, for example, you can
see that here an instance of a data frame is created and named as “myTable” by using the
command data.frame() and this creates an empty data frame.
myTable = data.frame()

Step 2: Next we will use the edit function to launch the viewer. Note that “myTable” data frame is passed back
to the “myTable” object and this way the changes we make to this module will be saved to the original object.
myTable = edit(myTable)

So, when the above command is executed it will pop up a window like this,
Step 3: Now, the table is completed with this small roster.

Note that, change variable names by clicking on their labels and typing your changes. Variables can also be set
as numeric or character. Once the data in the DataFrame looks like the above, close the table. Changes are
saved automatically.
Step 4: Check out the resulting data frame by printing it.
> myTable

Name Language Age


1 Amiya R 22
2 Raj Python 25
3 Asish Java 45
Adding rows and columns to the data frame
Adding extra rows: We can add extra row using the command rbind(). The syntax for this is given below,
newDF = rbind(df, the entries for the new row you have to add )
df = Original data frame

Note that the entries for the new row you have to add you have to be careful when using rbind() because the
data types in each column entry should be equal to the data types that are already existing rows.
Example:

# R program to illustrate operation on a data frame

# Creating a dataframe
df = data.frame(
"Name" = c("Amiya", "Raj", "Asish"),
"Language" = c("R", "Python", "Java"),
"Age" = c(22, 25, 45)
)
cat("Before adding row\n")

print(df)

# Add a new row using rbind()


newDf = rbind(df, data.frame(Name = "Sandeep",
Language = "C",
Age = 23

))
cat("After Added a row\n")
print(newDf)

Output:
Before adding row
Name Language Age
1 Amiya R 22
2 Raj Python 25
3 Asish Java 45

After Added a row


Name Language Age
1 Amiya R 22
2 Raj Python 25
3 Asish Java 45
4 Sandeep C 23
Adding extra columns: We can add extra column using the command cbind(). The syntax for this is given
below,
newDF = cbind(df, the entries for the new column you have to add )
df = Original data frame

Example:

# R program to illustrate operation on a data frame

# Creating a dataframe
df = data.frame(
"Name" = c("Amiya", "Raj", "Asish"),
"Language" = c("R", "Python", "Java"),
"Age" = c(22, 25, 45)

)
cat("Before adding column\n")
print(df)

# Add a new column using cbind()


newDf = cbind(df, Rank=c(3, 5, 1))

cat("After Added a column\n")


print(newDf)

Output:
Before adding column
Name Language Age
1 Amiya R 22
2 Raj Python 25
3 Asish Java 45

After Added a column


Name Language Age Rank
1 Amiya R 22 3
2 Raj Python 25 5
3 Asish Java 45 1
Adding new variables to DataFrame
In R, we can add new variables to a data frame based on existing ones. To do that we have to first call
the dplyr library using the command library() . And then calling mutate() function will add extra variable
columns based on existing ones.
Syntax:
library(dplyr)
newDF = mutate(df, new_var=[existing_var])
df = original data frame
new_var = Name of the new variable
existing_var = The modify action you are taking(e.g log value, multiply by 10)

Example:

# R program to illustrate operation on a data frame


# Importing the dplyr library
library(dplyr)

# Creating a dataframe
df = data.frame(
"Name" = c("Amiya", "Raj", "Asish"),
"Language" = c("R", "Python", "Java"),
"Age" = c(22, 25, 45)
)

cat("Original Dataframe\n")
print(df)

# Creating an extra variable column


# "log_Age" which is log of variable column "Age"
# Using mutate() command

newDf = mutate(df, log_Age = log(Age))

cat("After creating extra variable column\n")


print(newDf)

Output:
Original Dataframe
Name Language Age
1 Amiya R 22
2 Raj Python 25
3 Asish Java 45

After creating extra variable column


Name Language Age log_Age
1 Amiya R 22 3.091042
2 Raj Python 25 3.218876
3 Asish Java 45 3.806662
Deleting rows and columns from a data frame
To delete a row or a column, first of all, you need to access that row or column and then insert a negative sign
before that row or column. It indicates that you had to delete that row or column.
Syntax:
newDF = df[-rowNo, -colNo]
df = original data frame

Example:

# R program to illustrate operation on a data frame


# Creating a dataframe

df = data.frame(
"Name" = c("Amiya", "Raj", "Asish"),
"Language" = c("R", "Python", "Java"),
"Age" = c(22, 25, 45)
)
cat("Before deleting the 3rd row and 2nd column\n")
print(df)

# delete the third row and the second column


newDF = df[-3, -2]

cat("After Deleted the 3rd row and 2nd column\n")


print(newDF)

Output:
Before deleting the 3rd row and 2nd column
Name Language Age
1 Amiya R 22
2 Raj Python 25
3 Asish Java 45
After Deleted the 3rd row and 2nd column
Name Age
1 Amiya 22
2 Raj 25

You might also like