R 1st Unit
R 1st Unit
In simple words, the vector in R is a very basic data structure in R Programming. Using vectors
we can create other data structures like Data Frame and Matrix.
1. What is R Vector?
The vector in R is the simplest basic t ype of object i n R. A vector is a sequence of data elements
of the same t ype. A vector is similar to a sequence object in Pyt hon pandas. Members of a vector
are called Components. Below are some key points to note and remember about vectors.
Let’s create a simple int eger vector using c(). Note that the c() funct ion stands for concatenate.
but it is one of the most used ways to create a vector in R.
# Integer vector
v_int = c(1,3,6,7)
print(v_int)
# Output
#[1] 1 3 6 7
And, you can also use this to create a character vector. Alternat ively, you can also use
the character() funct ion.
# Character vector
v_char = c('One','Two','Three')
print(v_char)
# Output
#[1] "One" "Two" "Three"
Use the lengt h() funct ion to get the lengt h of t he vector. Pass the vector as the argument to this
funct ion to get the length.
# Output
#[1] 3
You can get the type of the vector using typeof() funct ion. This takes the vector object as an
argument and returns it s type.
If you have a character vector, you can easily convert the character vector to double by
using as.vector() funct ion wit h a mode value "numeric".
# Syntax o f as.vector()
as.vector(x, mode = "any")
Example
# Output
#[1] "character"
#[1] "double"
# Syntax o f is.vector()
is.vector(x, mode = "any")
Example
# Output
[1] TRUE
[1] FALSE
3. Empty Vector
If the length of the vector is zero then it is considered an empt y vector in R. It can be created in
several ways in R programming. One way to create an empt y vector is by using c() function
without any arguments.
# Output
NULL
Now let’s get the length of the vector and check if it is empt y.
By using data.frame() funct ion we can create a DataFrame from Vectors. A data frame is a list of
variables of the same number of rows with unique row names. so all vectors you used to convert
to DataFrame should have the same length, not having the same length will result in an error.
# Create Vectors
id <- c(10,11,12,13)
name <- c('sai','ram','deepika','sahit hi')
dob <- as.Date(c('1990-10-02','1981-3-24','1987-6-14','1985-8-16'))
# Create DataFrame
df <- data.frame(id,name,dob)
Yields below output. In the above example, I have used the vectors id, name and dob as
arguments to the data.frame() funct ion, separated by commas. The above example yields the
below output. R will create a data frame wit h the column names/variables wit h the same names
we used for Vector. You can use df or print(df) to print the DataFrame to the console.
# Output
id name dob
1 10 sai 1990-10-02
2 11 ram 1981-03-24
3 12 deepika 1987-06-14
4 13 sahit hi 1985-08-16
6. Repeating Vectors
By using rep() funct ion we can repeat the members in a vector with the specified number
of t imes, and create a new vector with the result.
# Repeat vectors
v_rep = rep(c(1, 5), times = 3)
print(v_rep)
# Output
#[1] 1 5 1 5 1 5
Let’s pass a vector to times and see what happens. This is used to specify how often the values
are to be repeated. You can also use each argument to specify how many t imes to repeat each
value.
# Repeat vectors
v_rep = rep(c(1, 5), times = c(2,3))
print(v_rep)
# Output
#[1] 1 1 5 5 5
7. Types of Vector in R
As explained above c() is actually used to combine members hence, let’s use this to combine two
or more vectors into a single vector. By using c() or append() you can also add element to vector
in R.
# Combine vectors
v_combine = c(v_char,v_char2)
print(v_combine)
# Output
#[1] "One" "Two" "Three" "Four" "Five" "Six"
9. Access Vector
By using R bracket notation you can access the members by index. Note that, unlike other
programming languages, the R index starts from 1. To get the range of values
use startindex:endindex, this returns all values between the start and end index. Let’s access it
by giving an index inside a single square bracket [].
When you use a negat ive index, it just removes the specified index vector and returns the result.
For example, the following creates a vector slice wit h the second member removed.
# Output
#[1] "Four" "Six"
# Output
#[1] NA
By using an index let’s see how to modify or change the value of an index. The following
example changes the value of the first member of the vector to the character string Oneeeeeeee.
Vector can hold only elements of the same type. whereas the List can hold elements of
different types like numeric, character, logical e.t.c
Vector is a single dimension, whereas the List is a mult i-dimensional object.
Vector is not recursive, whereas the List is recursive.
# Create list
list_obj = list('One',2,TRUE)
print(list_obj)
# Output[[1]]
#[[1]]
#[1] "One"
#[[2]]
#[1] 2
#[[3]]
#[1] TRUE
NA is considered a ‘Missing Values’ or ‘Not Available’ in R and to remove NA values from the
vector use na.rm() or is.na() funct ions. Additionally R also supports several other met hods to
delete NA from the vector.
When you run an arit hmet ic operation on the R vector that has NA values, you will get results as
NA. In general, any operation on an NA becomes a NA. hence, It’s best pract ice to clean up the
vector by removing all NA values before performing an arit hmet ic operat ion.
# Quick Examples
# Create Vector
ids <- c(10,5,NA,15,10)
# Using is.na()
x <- ids[!is.na(ids)]
# Using na.rm=TRUE
x <- max(ids, na.rm=TRUE)
# Using na.omit()
x <- na.omit(ids)
# Using exclude
x <- na.exclude(states)
print(x)
13. Sorting Vectors
To sort a vector alphabet ically using the sort() funct ion that takes a vector as an argument and
returns an alphabet ically ordered value for the character vector and ascending order for numeric.
# Output
#> sort(v)
#[1] "A" "D" "G" "R" "X"
#>sort(v, decreasing=TRUE)
#[1] "X" "R" "G" "D" "A"
To subset elements from the vector use bracket notation [], by using this notation we can subset
a single element from a vector, mult iple elements, and subset by range, select elements from a
list e.t.c.
# Create a vector
v <- c('A','B','C','D','E','F')
# Output
#> v[1]
#[1] "A"
# Output
#> v[c(1,3)]
#[1] "A" "C"
#> v[c(2.3,4.5)]
#[1] "B" "D"
15. Removing Duplicates
Handling duplicate values is one of the challenging tasks when analyzing data. Removing
duplicates comes under data cleaning which is a challenging task in data analyt ics. Data cleaning
needs to be done before performing any operations on data as having duplicate values result s in
inconsistent results. In the R vector, By using duplicated() funct ion we can remove duplicates
from the vector.
Remove duplicates
v <- c('A','B','D','C','A','F','G','C','d','E','E')
v[!duplicated(v)]
# Output
#[1] "A" "B" "D" "C" "F" "G" "d" "E"
16. Remove Elements
To remove mult iple values from a vector, define a vector or list wit h the values you wanted to
remove and use it wit h bracket notation [] and %in% operator. The %in% operator is used to
check what values to remove.
# Output
#[1] "B" "C" "D" "E" "F" "B"
# Output
#[1] "B" "C" "F" "B"
Vector recycling in R
We can see vector recycling, when we perform some kind of operations like addition, subtraction. . . .etc on two
vectors of unequal length. The vector with a small length will be repeated as long as the operation completes on
the longer vector. If we perform an addition operation on a vector of equal length the first value of vector1 is
added with the first value of vector 2 like that. The below image demonstrated operation on unequal vectors and
operation on equal vector.
So, the repetition of small length vector as long as completion of operation on long length vector is known
as vector recycling. This is the special property of vectors is available in R language. Let us see the
implementation of vector recycling.
Example 1 :
Output :
In vector recycling, the length of the long length vector should be the multiple of the length of a small length
vector. If not we will get a warning that longer object length is not a multiple of shorter object length. Here the
longer object length is multiple of the shortest object length. So, we didn’t get a warning message.
Example 2 :
Output :
Here also the longer object length is multiple of the shortest object length. So, we didn’t get warning message.
Example 3 :
Output :
Here the longer object length is not multiple of the shortest object length. So, we got a warning message.
# R program to illustrate
# dim function
Output:
[1] 1 Output:
[1] 1 2 3 4 5 6 7 8 9
[, 1] [, 2] [, 3]
[1, ] 1 4 7
[2, ] 2 5 8
[3, ] 3 6 9
Last Updated : 04 Jun, 2020
3, ] 3 6 9
R Matrix
In R, a two-dimensional rectangular data set is known as a matrix. A matrix is created with the help of the
vector input to the matrix function. On R matrices, we can perform addition, subtraction, multiplication, and
division operation.
In the R matrix, elements are arranged in a fixed number of rows and columns. The matrix elements are the real
numbers. In R, we use matrix function, which can easily reproduce the memory representation of the matrix. In
the R matrix, all the elements must share a common basic type.
Example
1. matrix1<-matrix(c(11, 13, 15, 12, 14, 16),nrow =2, ncol =3, byrow = TRUE)
2. matrix1
Output
Like vector and list, R provides a function which creates a matrix. R provides the matrix() function to create a
matrix. This function plays an important role in data analysis. There is the following syntax of the matrix in R:
data
The first argument in matrix function is data. It is the input vector which is the data elements of the matrix.
nrow
The second argument is the number of rows which we want to create in the matrix.
ncol
The third argument is the number of columns which we want to create in the matrix.
byrow
The byrow parameter is a logical clue. If its value is true, then the input vector elements are arranged by row.
dim_name
The dim_name parameter is the name assigned to the rows and columns.
Let's see an example to understand how matrix function is used to create a matrix and arrange the elements
sequentially by row or column.
Matrix is similar to vector but additionally contains the dimension attribute. All attributes of an object can be
checked with the attributes() function (dimension can be checked directly with the dim() function).
We can check if a variable is a matrix or not with the class() function.
Output
matrix(1:9, nrow = 3)
Output
We can see that the matrix is filled column-wise. This can be reversed to row-wise filling by passing TRUE to
the argument byrow.
Output
In all cases, however, a matrix is stored in column-major order internally as we will see in the subsequent
sections.
It is possible to name the rows and columns of matrix during creation by passing a 2 element list to the
argument dimnames:
Output
ABC
X147
Y258
Z369
These names can be accessed or changed with two helpful functions colnames() and rownames().
Output
cbind(c(1,2,3),c(4,5,6))
rbind(c(1,2,3),c(4,5,6))
Output
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
Here, the cbind() function is used to combine two vectors, c(1, 2, 3) and c(4, 5, 6), by column.
The resulting matrix will have two columns, where the first column contains the elements of the first vector and
the second column contains the elements of the second vector.
Similarly, the rbind() function is used to combine two vectors, c(1, 2, 3) and c(4, 5, 6), by row.
The resulting matrix will have two rows, where the first row contains the elements of the first vector and the
second row contains the elements of the second vector.
x <- c(1,2,3,4,5,6)
Output
x <- c(1,2,3,4,5,6)
class(x)
Output
[1] "numeric"
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
[1] "matrix" "array"
# leaving row as well as column field blank will select entire matrix
x[,]
Output
[,1] [,2]
[1,] 2 3
[2,] 5 6
[,1] [,2] [,3]
[1,] 7 8 9
[2,] 4 5 6
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
[,1] [,2] [,3]
[1,] 4 5 6
[2,] 7 8 9
One thing to notice here is that, if the matrix returned after indexing is a row matrix or column matrix, the result
is given as a vector. In the above example if we do:
x[1,]
class(x[1,])
We get,
Output
[1] 1 2 3
[1] "numeric"
This behavior can be avoided by using the argument drop = FALSE while indexing. If we do:
class(x[1,,drop=FALSE])
Then we get,
Output
While indexing in such a way, it acts like a vector formed by stacking columns of the matrix one after another.
The result is returned as a vector.
Output
[,1] [,2]
[1,] 4 6
[2,] 3 7
[,1] [,2]
[1,] 6 1
[2,] 7 9
It is also possible to index using a single logical vector where recycling takes place if necessary.
x[c(TRUE, FALSE)]
In the above example, the matrix x is treated as a vector formed by stacking columns of the matrix one after
another, i.e., (4,6,1,8,0,2,3,7,9).
The indexing logical vector is also recycled and thus alternating elements are selected. This property is utilized
for filtering of matrix elements as shown below.
Output
[1] 6 8 7 9
[1] 4 6 8 0 2
# subset the matrix by selecting rows that are TRUE and columns "A" and "C"
x[TRUE, c("A", "C")]
# subset the matrix by selecting rows 2 to 3 and columns "A" and "C"
x[2:3, c("A", "C")]
Output
[1] 4 8 3
AC
[1,] 4 1
[2,] 8 2
[3,] 3 9
AC
[1,] 8 2
[2,] 3 9
How to modify a matrix in R?
We can combine the assignment operator with the above learned methods for accessing elements of a matrix to
modify it. For example,
Output
Transpose a Matrix
A common operation with matrices is to transpose it. This can be done with the function t().
t(x)
Output
We can add rows or columns using the rbind() and cbind() function respectively. Similarly, it can be removed
through reassignment.
# add a row
rbind(x,c(1,2,3))
Output
Dimension of the matrix can be modified as well, using the dim() function.
Output
[,1] [,2]
[1,] 1 5
[2,] 4 3
[3,] 2 6
In R, we can perform the mathematical operations on a matrix such as addition, subtraction, multiplication, etc.
For performing the mathematical operation on the matrix, it is required that both the matrix should have the
same dimensions.
Let see an example to understand how mathematical operations are performed on the matrix.
Example 1
#Addition
sum<-R+S
print(sum)
#Subtraction
sub<-R-S
print(sub)
#Multiplication
mul<-R*S
print(mul)
#Multiplication by constant
mul1<-R*12
print(mul1)
#Division
div<-R/S
print(div)
Output
Introduction to Factors in R
Factors play a crucial role in data analysis. Learn how to create, subset, and compare them.
What is a Factor?
A factor refers to a statistical data type used to store categorical variables. Categorical variables belong to
a limited number of categories. Continuous variables, on the other hand, can correspond to an infinite number
of values.
It is important to distinguish whether you are working with a categorical variable or a continuous variable, as
statistical models treat both types differently.
An example of a categorical variable is sex. In most cases, you can limit the categories to “Male” or “Female”.
While there may be circumstances where you may include additional categories (whether to take into
consideration chromosomal variation, hermaphroditic animals, or different cultural norms), you will always have
a finite number of categories.
To create factors in R, use the factor() function. The factor function is used to encode a vector as a factor (other
terms for factors are ‘category’ and ‘enumerated type’). For example, sex_vector contains the sex of 5 different
individuals:
sex_vector <- c("Male", "Female", "Female", "Male", "Male")
It is clear that there are two categories (factor levels): “Male” and “Female”.
R constructs and prints nominal and ordinal variables differently. For nominal variables, there is no ordering, so
R prints them without indicating any ordering.
# Animals
[1] Elephant Giraffe Donkey Horse
Levels: Donkey Elephant Giraffe Horse
Factor Levels
Sometimes, it can be helpful to change the names of specific factor levels in a data set for clarity or other
reasons. You can do this using the levels() function:
levels(factor_vector) <- c("name1", "name2", ...)
For example, suppose the raw data from a survey contains the a question regarding the sex of the respondent and
just two categories were recorded: "M" and "F" .
survey_vector <- c("M", "F", "F", "M", "M")
Recording the sex with abbreviations can be convenient while collecting data (especially with pen and paper),
but it can become confusing once it is time to analyze the data. At that point, you may want to change the factor
levels to "Male" and "Female" instead of "M" and "F" for clarity.
Caution: the order with which you assign the levels is important. levels(factor_survey_vector) outputs [1] "F"
"M" . If you don’t specify the levels of the factor when you are creating a vector, R will automatically assign
them alphabetically. To correctly map "F" to "Female" and "M" to "Male" , the levels should be set
to c("Female", "Male") , in this order.
Instead of the “F” and “M” default levels from the data, we’ve renamed them to “Female” and “Male”.
Summarizing a Factor
Suppose we want to determine how many responses of each factor level we have in our vector. We can use
the summary() function to determine this.
summary(factor_survey_vector) is clearly more useful in this case in comparison to summary(survey_vector)
Factor Comparison
What happens when you try to compare elements of a factor? In factor_survey_vector you have a factor with two
levels: "Male" and "Female" . But how are these valued relative to each other?
Ordered Factors
For unordered (or nominal) factor levels, such as "Male" and "Female" , comparison operators, such as the
greater than operator, are not meaningful and R returns a warning message to reflect that.
R attaches an equal value to the levels for factors by default. However, this does not necessarily have to be the
case. Sometimes you will also deal with factors that have a natural ordering between its categories. Such factors
are called ordinal factors. In this circumstance, we must pass this information to R.
Suppose you want to evaluate the performance five analysts. To do this, you track their speed as
either "slow" , "medium" , or "fast" . You get the following results:
Analyst 1 is medium,
Analyst 2 is slow,
Analyst 3 is slow,
Analyst 5 is fast.
Our first step is the construct a vector with five entries and assign the speed factor to each analyst.
# Create speed_vector
speed_vector <- c("medium", "slow", "slow", "medium", "fast")
The speed_vector should be converted to an ordinal factor since its categories have a natural ordering. That
is, "fast" is greater than "medium" , which is then greater than "slow" .
By default, the factor() function transforms the speed_vector into an unordered factor. To create an ordered
factor, two additional arguments are needed: ordered and levels .
factor(some_vector, ordered = TRUE, levels = c("lev1", "lev2", ...))
By setting the argument ordered to TRUE , you indicate that the factor is ordered. With the argument levels you
give the values of the factor in the correct order.
Solution
# Create speed_vector
speed_vector <- c("medium", "slow", "slow", "medium", "fast")# Convert speed_vector to ordered factor vector
factor_speed_vector <- factor(speed_vector, ordered=TRUE, levels=c("slow", "medium", "fast"))# Print
factor_speed_vector
factor_speed_vector
summary(factor_speed_vector)
Comparing Ordered Factors
You can compare different elements of an ordered factor by using the well-known operators. For example, to
compare if the the element of the first factor vector is greater than the first element of the second factor vector,
you would use the greater than operator ( > ).
factor_vector1[1] > factor_vector2[1]
Solution
# Create factor_speed_vector
speed_vector <- c("medium", "slow", "slow", "medium", "fast")
factor_speed_vector <- factor(speed_vector, ordered = TRUE, levels = c("slow", "medium", "fast"))# Factor value for
second data analyst
da2 <- factor_speed_vector[2]# Factor value for fifth data analyst
da5 <- factor_speed_vector[5]# Is data analyst 2 faster than data analyst 5?
da2 > da5
The results tell us that data analyst two is complaining about data analyst five while they are in fact the one
slowing everything down.
R Lists
If a vector has elements of different types, it is called a list in R programming.
A list is a flexible data structure that can hold elements of different types, such as numbers, characters, vectors,
matrices, and even other lists.
Output
$a
[1] 2.5
$b
[1] TRUE
$c
[1] 1 2 3
Here, we create a list x, of three components with data types double, logical and integer vector respectively.
The structure of the above list can be examined with the str() function
str(x)
Output
List of 3
$ a:num 2.5
$ b:logi TRUE
$ c:int [1:3] 1 2 3
In this example, a, b and c are called tags which makes it easier to reference the components of the list.
However, tags are optional. We can create the same list without the tags as follows. In such a scenario, numeric
indices are used by default.
x <- list(2.5,TRUE,1:3)
x
Output
[[1]]
[1] 2.5
[[2]]
[1] TRUE
[[3]]
[1] 1 2 3
Output
[1] "John"
[1] 19
[1] "English" "French"
$name
[1] "John"
$age
[1] 19
$name
[1] "John"
$speaks
[1] "English" "French"
$name
[1] "John"
$age
[1] 19
$speaks
[1] "English" "French"
Indexing with [ as shown above will give us a sublist not the content inside the component. To retrieve the
content, we need to use [[.
However, this approach will allow us to access only a single component at a time.
# check the type of the result (double bracket returns the content)
typeof(x[["age"]])
Output
$age
[1] 19
[1] "list"
[1] 19
[1] "double"
An alternative to [[, which is used often while accessing content of a list is the $ operator. They are both the
same, except that $ can do partial matching on tags.
Output
[1] "John"
[1] 19
[1] "English" "French"
[1] "Alice"
[1] 25
[1] "Spanish" "Italian"
Output
$name
[1] "Clair"
$age
[1] 19
$speaks
[1]"English" "French"
Output
$name
[1] "Clair"
$age
[1] 19
$speaks
[1] "English" "French"
$married
[1] FALSE
How to delete components from a list?
We can delete a component by assigning NULL to it.
Output
List of 2
$ name :chr "Clair"
$ speaks:chr [1:2] "English" "French"
List of 2
$ name :chr "Clair"
$ speaks:chr [1:2] "English" "French"
R - Data Frames
A data frame is a table or a two-dimensional array-like structure in which each column contains values of one
variable and each row contains one set of values from each column.
Live Demo
The structure of the data frame can be seen by using str() function.
Live Demo
The statistical summary and nature of the data can be obtained by applying summary() function.
Live Demo
Live Demo
start_date = as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Extract Specific columns.
result <- data.frame(emp.data$emp_name,emp.data$salary)
print(result)
emp.data.emp_name emp.data.salary
1 Rick 623.30
2 Dan 515.20
3 Michelle 611.00
4 Ryan 729.00
5 Gary 843.25
Live Demo
Extract 3rd and 5th row with 2nd and 4th column
Live Demo
# Extract 3rd and 5th row with 2nd and 4th column.
result <- emp.data[c(3,5),c(2,4)]
print(result)
emp_name start_date
3 Michelle 2014-11-15
5 Gary 2015-03-27
Add Column
Live Demo
Add Row
To add more rows permanently to an existing data frame, we need to bring in the new rows in the same
structure as the existing data frame and use the rbind() function.
In the example below we create a data frame with new rows and merge it with the existing data frame to create
the final data frame.
Live Demo
R packages are a collection of R functions, complied code and sample data. They are stored under a directory
called "library" in the R environment. By default, R installs a set of packages during installation. More
packages are added later, when they are needed for some specific purpose. When we start the R console, only
the default packages are available by default. Other packages which are already installed have to be loaded
explicitly to be used by the R program that is going to use them.
Below is a list of commands to be used to check, verify and use the R packages.
Live Demo
.libPaths()
When we execute the above code, it produces the following result. It may vary depending on the local settings
of your pc.
Live Demo
library()
When we execute the above code, it produces the following result. It may vary depending on the local settings
of your pc.
Live Demo
search()
When we execute the above code, it produces the following result. It may vary depending on the local settings
of your pc.
There are two ways to add new R packages. One is installing directly from the CRAN directory and another is
downloading the package to your local system and installing it manually.
The following command gets the packages directly from CRAN webpage and installs the package in the R
environment. You may be prompted to choose a nearest mirror. Choose the one appropriate to your location.
install.packages("Package Name")
Go to the link R Packages to download the package needed. Save the package as a .zip file in a suitable location
in the local system.
Now you can run the following command to install this package in the R environment.
Before a package can be used in the code, it must be loaded to the current R environment. You also need to load
a package that is already installed previously but not available in the current environment.
What is Tidyverse?
Tidyverse is a collection of essential R packages for data science. The packages under the tidyverse umbrella
help us in performing and interacting with the data. There are a whole host of things you can do with your data,
such as subsetting, transforming, visualizing, etc.
Tidyverse packages
The core tidyverse includes the packages that you’re likely to use in everyday data analyses. As of tidyverse
1.3.0, the following packages are included in the core tidyverse:
ggplot2
ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the
data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the
details. Go to docs...
dplyr
dplyr provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common
data manipulation challenges. Go to docs...
tidyr
tidyr provides a set of functions that help you get to tidy data. Tidy data is data with a consistent form: in brief,
every variable goes in a column, and every column is a variable. Go to docs...
readr
readr provides a fast and friendly way to read rectangular data (like csv, tsv, and fwf). It is designed to flexibly
parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes. Go to
docs...
purrr
purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for
working with functions and vectors. Once you master the basic concepts, purrr allows you to replace many for
loops with code that is easier to write and more expressive. Go to docs...
tibble
tibble is a modern re-imagining of the data frame, keeping what time has proven to be effective, and throwing
out what it has not. Tibbles are data.frames that are lazy and surly: they do less and complain more forcing you
to confront problems earlier, typically leading to cleaner, more expressive code. Go to docs...
stringr
stringr provides a cohesive set of functions designed to make working with strings as easy as possible. It is built
on top of stringi, which uses the ICU C library to provide fast, correct implementations of common string
manipulations. Go to docs...
forcats
forcats provides a suite of useful tools that solve common problems with factors. R uses factors to handle
categorical variables, variables that have a fixed and known set of possible values. Go to docs...
The tidyverse also includes many other packages with more specialised usage. They are not loaded
automatically with library(tidyverse), so you’ll need to load each one with its own call to library().
Output:
ID Name Post Age
1 5 H CA 67
2 6 K SDE 39
3 7 Z Admin 28
Method 2: Using read.table() methods.
Here we will use read.table() methods to import CSV file into R Programming Language.
# print x
print(x)
Output:
Col1.Col2.Col3
1 100, a1, b1
2 200, a2, b2
3 300, a3, b3
Importing Data from a Text File
We can easily import or read .txt file using basic R function read.table(). read.table() is used to read a file in
table format. This function is easy to use and flexible.
Syntax:
# read data stored in .txt file
x<-read.table(“file_name.txt”, header=TRUE/FALSE)
# print x
print(x)
Output:
V1 V2 V3
1 100 a1 b1
2 200 a2 b2
3 300 a3 b3
If the header argument is set at TRUE, which reads the column names if they exist in the file.
Importing Data from a delimited file
R has a function read.delim() to read the delimited files into the list. The file is by default separated by a tab
which is represented by sep=””, that separated can be a comma(, ), dollar symbol($), etc.
Syntax: read.delim(“file_name.txt”, sep=””, header=TRUE)
# print x
print(x)
# print type of x
typeof(x)
Output:
X.V1.V2.V3
1 1, 100, a1, b1
2 2, 200, a2, b2
3 3, 300, a3, b3
[1] "list
Importing Json file in R
Here we are going to use rjson package to import the JSON file into R Programming Language.
Output:
$ID
[1] "1" "2" "3" "4" "5"
$Name
[1] "Mithuna" "Tanushree" "Parnasha" "Arjun" "Pankaj"
$Salary
[1] "722.5" "815.2" "1611" "2829" "843.25"
Importing XML file in R
To import XML file here we are going to use XML Package in R Programming language.
XML file for demonestration:
<RECORDS>
<STUDENT>
<ID>1</ID>
<NAME>Alia</NAME>
<MARKS>620</MARKS>
<BRANCH>IT</BRANCH>
</STUDENT>
<STUDENT>
<ID>2</ID>
<NAME>Brijesh</NAME>
<MARKS>440</MARKS>
<BRANCH>Commerce</BRANCH>
</STUDENT>
<STUDENT>
<ID>3</ID>
<NAME>Yash</NAME>
<MARKS>600</MARKS>
<BRANCH>Humanities</BRANCH>
</STUDENT>
<STUDENT>
<ID>4</ID>
<NAME>Mallika</NAME>
<MARKS>660</MARKS>
<BRANCH>IT</BRANCH>
</STUDENT>
<STUDENT>
<ID>5</ID>
<NAME>Zayn</NAME>
<MARKS>560</MARKS>
<BRANCH>IT</BRANCH>
</STUDENT>
</RECORDS>
library("methods")
print(data)
Output:
1
Alia
620
IT
2
Brijesh
440
Commerce
3
Yash
600
Humanities
4
Mallika
660
IT
5
Zayn
560
IT
Reading the CSV file into Data frames in R
CSV is expanded as Comma, Separated, Values. In this file, the values stored are separated by a comma. This
process of storing the data is much easier.
Why CSV is the most used file format for data storing?
Storing the data in an excel sheet is the most common practice in many companies. In the majority of firms,
people are storing data as comma-separated-values (CSV), as the process is easier than creating normal
spreadsheets. Later they can use R’s built in packages to read and analyze the data.
Being the most popular and powerful statistical analysis programming language, R offers specific functions to
read data into organized data frames from a CSV file.
In this short example, we will see how we can read a CSV file into organized data frames.
The first thing in this process is to getting and setting up the working directory. You need to choose the
working path of the CSV file.
Here you can check the default working directory using getwd() function and you can also change the directory
using the function setwd().
----> "C:/Users/Dell/Documents"
After the setting of the working path, you need to import the data set or a CSV file as shown below.
> readfile <- read.csv("testdata.txt")
Execute the above line of code in R studio to get the data frame as shown below.
To check the class of the variable ‘readfile’, execute the below code.
> class(readfile)
---> "data.frame"
In the above image you can see the data frame which includes the information of student names, their ID’s,
departments, gender and marks.
After getting the data frame, you can now analyse the data. You can extract particular information from the data
frame.
>marks <- max(data$Marks.Scored) #this will give you the highest marks
#To extract the details of a student who scored the highest marks,
> View(retval)
To extract the details of the students who are in studying in ‘chemistry’ Dept,
> retval <- subset( data, Department == "chemistry") # This will extract the student details who are in
Biochemistry department
> View(retval)
Indexing and Slicing Data Frames in R
In this article let’s discuss indexing and slicing the Data Frames in R Programming Language.
What is Indexing?
The process of accessing particular data components or subsets within a vector, matrix, or data frame is called
indexing. It enables us to pick, remove, or change particular values or parts of our data based on criteria.
Indexing the Data Frame in R
By indexing the Data Frame in R we will get the particular column data. Indexing can be done by specifying
column names in square brackets. The syntax for indexing the data frame is-
dataframeName[“columnName”]
Example: In this example let’s create a Data Frame “stats” that contains runs scored and wickets taken by a
player and perform indexing on the data frame to extract runs scored by players.
print("stats Dataframe")
stats
stats["runs"]
Output:
"stats Dataframe"
player runs wickets
1 A 100 17
2 B 200 20
3 C 408 NA
4 D NA 5
runs
1 100
2 200
3 408
4 NA
Slicing In R
The process of extracting particular subsets or sections of a vector, matrix, or data frame depending on
predetermined criteria is known as slicing. Using start and end indices, we can choose sequential ranges of
elements or subsets of data.
Slicing the Data Frame
Slicing the Data Frame gives the required rows and columns. This can be done by three ways. They are listed
below-
Slicing with [ , ]
Slicing with logical vectors.
Slicing with subset().
Slicing with [ , ]
Slicing the data frame with [ , ] returns data of specified rows and columns. The syntax of this is mentioned
below-
dataframeName[ fromRow : toRow , columnNumber]
Example:
In the below code we performed slicing on the data frame to fetch specified rows and columns.
print("stats Dataframe")
stats
Output
"stats Dataframe"
player runs wickets
1 A 100 17
2 B 200 20
3 C 408 NA
4 D NA 5
player runs
2 B 200
3 C 408
the
players - [1] "A" "B" "C"
print("stats Dataframe")
stats
Output
"stats Dataframe"
player runs wickets
1 A 100 17
2 B 200 20
3 C 408 3
4 D 23 5
player runs wickets
2 B 200 20
3 C 408 3
print("stats Dataframe")
stats
Output
"stats Dataframe"
player runs wickets
1 A 100 17
2 B 200 20
3 C 408 3
4 D 23 5
player wickets
1 A 17
2 B 20
DataFrame Operations in R
DataFrames are generic data objects of R which are used to store the tabular data. Data frames are considered
to be the most popular data objects in R programming because it is more comfortable to analyze the data in the
tabular form. Data frames can also be taught as mattresses where each column of a matrix can be of the
different data types. DataFrame are made up of three principal components, the data, rows, and columns.
Creating a DataFrame
Accessing rows and columns
Selecting the subset of the data frame
Editing dataframes
Adding extra rows and columns to the data frame
Add new variables to dataframe based on existing ones
Delete rows and columns in a data frame
Creating a DataFrame
In the real world, a DataFrame will be created by loading the datasets from existing storage, storage can be
SQL Database, CSV file, and an Excel file. DataFrame can also be created from the vectors in R. Following are
some of the various ways that can be used to create a DataFrame:
Creating a data frame using Vectors: To create a data frame we use the data.frame() function in R. To
create a data frame use data.frame() command and then pass each of the vectors you have created as
arguments to the function.
Example:
print(df)
Output:
Name Language Age
1 Amiya R 22
2 Raj Python 25
3 Asish Java 45
Creating a data frame using data from a file: Dataframes can also be created by importing the data from a
file. For this, you have to use the function called ‘read.table()‘.
Syntax:
newDF = read.table(path="Path of the file")
To create a dataframe from a CSV file in R:
Syntax:
newDF = read.csv("FileName.csv")
Accessing rows and columns
The syntax for accessing rows and columns is given below,
df[val1, val2]
df = dataframe object
val1 = rows of a data frame
val2 = columns of a data frame
So, this ‘val1‘ and ‘val2‘ can be an array of values such as “1:2” or “2:3” etc. If you specify only df[val2] this
refers to the set of columns only, that you need to access from the data frame.
Example: Row selection
# on a data frame
# Creating a dataframe
df = data.frame(
"Name" = c("Amiya", "Raj", "Asish"),
"Language" = c("R", "Python", "Java"),
Output:
Name Language Age
1 Amiya R 22
2 Raj Python 25
3 Asish Java 45
# Creating a dataframe
df = data.frame(
"Name" = c("Amiya", "Raj", "Asish"),
"Language" = c("R", "Python", "Java"),
"Age" = c(22, 25, 45)
)
print(df)
Output:
Name Language Age
1 Amiya R 22
2 Raj Python 25
3 Asish Java 45
Example:
Output:
Name Language Age
1 Amiya R 22
2 Raj Python 25
3 Asish Java 45
# Creating a dataframe
df = data.frame(
"Name" = c("Amiya", "Raj", "Asish"),
"Language" = c("R", "Python", "Java"),
"Age" = c(22, 25, 45)
)
cat("Before editing the dataframe\n")
print(df)
Output:
Before editing the data frame
Name Language Age
1 Amiya R 22
2 Raj Python 25
3 Asish Java 45
Step 2: Next we will use the edit function to launch the viewer. Note that “myTable” data frame is passed back
to the “myTable” object and this way the changes we make to this module will be saved to the original object.
myTable = edit(myTable)
So, when the above command is executed it will pop up a window like this,
Step 3: Now, the table is completed with this small roster.
Note that, change variable names by clicking on their labels and typing your changes. Variables can also be set
as numeric or character. Once the data in the DataFrame looks like the above, close the table. Changes are
saved automatically.
Step 4: Check out the resulting data frame by printing it.
> myTable
Note that the entries for the new row you have to add you have to be careful when using rbind() because the
data types in each column entry should be equal to the data types that are already existing rows.
Example:
# Creating a dataframe
df = data.frame(
"Name" = c("Amiya", "Raj", "Asish"),
"Language" = c("R", "Python", "Java"),
"Age" = c(22, 25, 45)
)
cat("Before adding row\n")
print(df)
))
cat("After Added a row\n")
print(newDf)
Output:
Before adding row
Name Language Age
1 Amiya R 22
2 Raj Python 25
3 Asish Java 45
Example:
# Creating a dataframe
df = data.frame(
"Name" = c("Amiya", "Raj", "Asish"),
"Language" = c("R", "Python", "Java"),
"Age" = c(22, 25, 45)
)
cat("Before adding column\n")
print(df)
Output:
Before adding column
Name Language Age
1 Amiya R 22
2 Raj Python 25
3 Asish Java 45
Example:
# Creating a dataframe
df = data.frame(
"Name" = c("Amiya", "Raj", "Asish"),
"Language" = c("R", "Python", "Java"),
"Age" = c(22, 25, 45)
)
cat("Original Dataframe\n")
print(df)
Output:
Original Dataframe
Name Language Age
1 Amiya R 22
2 Raj Python 25
3 Asish Java 45
Example:
df = data.frame(
"Name" = c("Amiya", "Raj", "Asish"),
"Language" = c("R", "Python", "Java"),
"Age" = c(22, 25, 45)
)
cat("Before deleting the 3rd row and 2nd column\n")
print(df)
Output:
Before deleting the 3rd row and 2nd column
Name Language Age
1 Amiya R 22
2 Raj Python 25
3 Asish Java 45
After Deleted the 3rd row and 2nd column
Name Age
1 Amiya 22
2 Raj 25