Data Science Unit-4
Data Science Unit-4
UNIT – IV
19.10.2022
INTRODUCTION TO ‘R’ PROGRAMMING
GETTING STARTED WITH ‘R’
INSTALLATION OF ‘R’ SOFTWARE AND USING THE INTERFACE VARIABLES AND
DATA TYPES
‘R’ OBJECTS
VECTORS & LISTS
OPERATIONS: ARITHMETIC, LOGICAL & MATRIX OPERATIONS
DATA FRAMES
FUNCTIONS
CONTROL STRUCTURES
DEBUGGING & SIMULATION IN ‘R’
R & R Studio
How to set the working directory
How to create an R file and save it
How to execute an R file
How to execute pieces of code
Page 1 of 1
This is how an R Studio Interface looks.
When you first run the application, to the left, we see Console panel, where you can type
in the commands and see the results that are generated when you type in the
commands.
The Files tab shows the files and directories that are available in the default workspace
of R. The Plots tab shows the plots that are generated during the course of
programming.
And the Packages tab helps you to look, what are the packages that are already
installed in the R Studio and it also gives an user interface, to install new packages.
The Help tab is a most important one, where you can get help from the R
Documentation on the functions that are in built in R.
The final and last tab is the Viewer tab, which can be used to see the local web content
that is generated using R, are some other application.
Page 2 of 2
The working directory in R Studio can be set in 2 ways. The first, way is to use the
console and using the command Set working directory.
You can use this function Set working directory and give the path of the directory which
u want to be the working directory for r studio, in the double codes.
R, to set the working directory from the GUI, you need to click on this 3 dots button.
When you click this, this will open up a file browser, which will help you to choose your
working directory.
Once you choose your working directory, you need to use this setting button in the more
tab and click it and then you get a popup menu, where you need to select Set as working
directory.
This will select the current directory, which you have chosen using this file browser as
your working directory.
Page 3 of 3
Once you set the working directory, you are ready to program in R Studio.
Let us illustrate how to create an R file and write some code. To create an R file, there
are 2 ways: The first way is: you can click on the file tab, from there when you when you
click it will give a drop down menu, where you can select new file and then R script, so
that, you will get a new file open.
Page 4 of 4
The other way is to use the + button, that is just below the file tab and you can choose R
script, from there, to open a new R script file.
Once you open an R script file, this is how an R Studio with the script file open looks like.
Page 5 of 5
So, 3 panels console environmental history and files and plots panels are there. On top
of that, you have a new window, which is now being opened as a script file. Now you are
ready to write a script file or some program in R Studio.
So, let us illustrate this with a small example, where I am assigning a value of 11 to a, in
the first line of the code which I have written and you have b which is a times 10, that is
the second command, I am evaluating the value of a times 10 and assign the value to
the b and the third statement, which is print c of a, b concatenates this a and b and print
the result.
So, this is how you write a script file in R.
Once you write a script file, you have to save this file before you execute it.
Page 6 of 6
Let us see, how to save the R file. From the file menu, if you click the file tab, you can
either save the file, when you want to save the file, if you click the save button, it will
automatically save the file has untitled x. So, this x can be 1 or 2 depending upon how
many R scripts you have already opened, or it is a nice idea, to use the Save as button,
just below the Save one, so that, you can rename the script file according to your wish.
Let us suppose we have click the, Save as button.
Page 7 of 7
This will pop out a window like this, where you can rename the script file as test R, are
the one which you are intended to. Once you rename, you can say save, that will save
the script file.
So now, we have seen how to open an R script and how to write some code in the R
script file. The next task is to execute the R file.
There are several ways you can execute the commands that are available in the R file.
The first way is to use run command.
This run command, can be executed using the GUI, by pressing the run button there, or
Page 8 of 8
you can use the Shortcut key, this is control + enter, what it does is, it will execute the
line in which the cursor is there.
The other way is to run the R code ‘R’ using source R source with echo.
The difference between source and source with echo is the following:
The Source command executes the whole R file and only prints the output, which you
wanted to print.
Whereas, source with echo prints the commands also, along with the output you are
printing.
So, this is an example, where I have executed the R file, using the source with echo, you
can see, in the console, that it printed the command a = 11 and the command b = a time
10 and also the output print c of a, b with the values. So, a = 11 and b = 11 times 10, this
is 110. So, this is how, the output will be printed in console. So, that is the result.
Page 9 of 9
Now, let us see how
to execute the
pieces of code in R.
So now, let us try to assign value 14 for a, and then try to run it.
So, how do you do this?
Take your cursor to the line, which you want to edit, replace that 11 by 14 and then use
control enter or the run button.
This will execute only the line, where the cursor is placed.
Page 10 of 10
In summary, we can say that, Run can
be used to execute the selected lines
of R code.
Page 11 of 11
28.10.2022
How to
o add comments
o clear the environment
o save the workspace
To add comments to a single line in R script, use hash key at the start of the
comment.
Commenting makes the script file more readable.
To make a line of code inert, insert ‘#’ at the start of the line
Page 12 of 12
There are 2 ways:
Select the multiple lines
which you want to
comment, using the
cursor and then use the
key combination control +
shift + C to comment or
uncomment the selected
lines.
The
console can be
cleared using the
shortcut key
control + L
Page 13 of 13
To clear the variables
from the R environment use
rm command
To clear a single
variable from the R
environment, rm followed
by the variable to be
removed
Page 14 of 14
The
environment is
empty now
Workspace data
It is sometimes needed to save the data which is already there in the current
session
Page 16 of 16
Variables and Data Types in R
Variables
Basic Data Types
R Objects
o Vectors
o Lists
1. The variable
name in R has to be
alphanumeric
characters with an
exception of
underscore and
period, the special
characters which
can be used in the
variable names.
2. The variable
name has to be
started always with
an alphabet and no
other special
characters except
the underscore and
period are allowed in
the variable names.
3. This shows
some examples of
the correct variable names that can be used in R.
The first one, b2 = 7, assigns the value of 7 to the variable b2. This is a valid variable
name because it started with an alphabet and it has only alphanumeric characters.
Similarly, the second variable Manoj_GDPL = scientist this is also valid variable name
because it has a special character, but it is underscore which is allowed special
character for the variable names.
Page 17 of 17
Examples where the variable names are not correct the variable 2b = 7, gives an error
because that variable name has started with the numeric character which is not
following the rules for the names of the variables in R.
R also contains some predefined constants that are available such as pi, letters, the
lowercase a to z and letters in the uppercase which are uppercase letters from A to Z
and months in a year, you can have full month name by month dot name and you can
have abbreviated month names by typing month dot abbreviation.
Page 18 of 18
The Data Types those are available in the R:
R has the following basic data types and this table shows the data type and the
values that each data type can take.
R has logical data types which take either a value of true or false
It supports integer data types which is the set of all integers and numeric which is
set of all real numbers.
R supports set of all the complex numbers.
Also, we can have a character data type where you have all the alphabets and
special characters which are under the window of basic data types of characters.
There are several tasks that can be done using data types.
The following table gives the task, action and the syntax for doing the task.
Page 19 of 19
For example, the first task is to find the data type of the object.
To find the data type object use typeof()
The syntax for doing that is, pass the object as an argument to the typeof() to find
the data type of an object
Warning message:
NAs introduced by coercion
The third task is convert the data type of one object to another,
Page 20 of 20
“as dot, before the data type” as the command;
as.data_type(object) – the syntax for doing that is as dot data type of the object which
is to be coerced.
Note that all the coercions are not possible and if attempted then it will be returning a
null value.
Numeric variable can be coerced into complex variable by using “as dot complex of”,
as.complex()
For example, we have as dot complex of 2, will convert this numeric variable 2 into the
complex variable 2 + 0i.
Coercing a character into a numeric variable using the command as.numeric(), which
has given us not available or NA.
This means the coercion from the characters to numeric numerical variables is not
possible.
Basic objects of R, the most important are; vectors, lists and data frames.
A vector is an ordered collection of same data types, list is ordered collection of object
themselves and data frame is a generic tabular object which is very important and the
most widely used objects of R programming language.
Basic Objects
Object Values
Vector Ordered collection of same Data Types
List Ordered collection of objects
Data Frame Generic tabular object
Page 21 of 21
Vectors
Define a vector which is containing four numeric variables and assigning it to a variable
X.
This is what the code here X = concatenation of these numbers and then printing X.
When executing this piece of code, this is how the output in the console looks.
It creates a vector X with the variables 2.3, 4.5, 6.7, 8.9 and prints them in the console.
Lists
Page 22 of 22
List is a generic object consisting of ordered collection of objects.
List can be a list of vectors, list of matrices, list of characters and list of functions
and so on.
Example – To illustrate how a list looks:
o To build a list of employees with the details for this we need the attributes
such as ID, employee name and number of employees.
o So, create each vector for those attributes.
The first attributes is a numeric variable containing the employee IDs which is
created using the command here, which is a numeric vector
The second attribute is employee name which is created using this line of code here,
which is the character vector
The third attribute is number of employees which is a single numeric variable.
Combine all these three different data types into a list containing the details of
employees which can be done using a list command.
This command here creates emp.list variable which is a list of the ID, emp.name and
Page 23 of 23
num.emp that are defined above.
Once list is created, then print the list and see how the output looks.
When this course is executed, in the console, the list is printed.
This is the first one IDs 1, 2, 3, 4;
This is the second element of the list which are contain the names of employees
The third element of the list which are saying how many number of employees are
available.
List is created.
All the components of a list can be named and those names can be used to access
the components of the list.
Page 24 of 24
For example, this is the same list created earlier – use the same ID, emp.name and
emp.employee.
Instead of directly creating a list, the names can be given for this attributes as ID,
names of employees and the total staff as shown in the code here.
Once this code is executed, list will be created and to access this element of the list,
use the dollar command emp.list is the list and to access the component with the
name, names.
While using this command and print the result, the names of the employees are
printed.
ID = c(1,2,3,4)
emp.name=c("man","rag","sha","din")
num.emp=4
emp.list=list(ID,emp.name,num.emp)
emp.list=list("id"=ID, "names"=emp.name, "Total Staff"=num.emp)
print(emp.list$names)
Output
[1] "man" "rag" "sha" "din"
Page 25 of 25
The components of the list can be accessed using indices.
To access the top level components of a list, use double slicing operator which is two
square brackets
To access the lower or inner level components of a list, use another square bracket
along with the double slicing operator.
The course here illustrates how to access the top level components;
For example, to access the IDs, use print emp.list and this is a double slicing operator
which will give the first level which is ID.
The second component can also be similarly accessed that is the result is shown here
and to access, for example, the first sub element or the inner component of the
component ID, use emp.list, the double slicing operator and the first element in the
another square bracket.
Similarly, to access the first employee name using double slicing operator to be
followed by the element one which prints the value man from the employee list.
Page 26 of 26
A list can also be modified by accessing the components and replacing them with the
ones which is required.
For example, to change the total number of staff into 5, that can be done easily by
assigning a value 5 to the total staff
To add a new employee name to the list the component of the list which has the
employee names is 2
To add this new name Nir as a new employee to that sub component.
So, directly assign this character variable Nir to the second component and fifth sub
component of the list.
Now, to increase the employee ID and to give this employee and new ID which is 5, in
this command is your accessing the fifth sub element of the level one component and
then assigning data value of 5.
Print the list, now the IDs, number of employees are 5 and total staff is 5 and the name
Nir is getting added to the list.
How to concatenate the list?
Page 27 of 27
Two lists can be concatenated using the concatenation function.
The syntax for that is concatenation of list 1 and list 2.
First list which already contains three attributes, to add another attribute, which is
emp.ages;
For this, create a new list which contains the ages of the five employees.
To concatenate this new list that is emp.ages with the original list which is emp.list, use
the concatenation operator.
This command concatenates these two lists – now assigning it to the emp.list.
While printing this newemp.list, another attribute ages is added to the original list.
01.11.2022
Data Frames
Page 28 of 28
Data frame
Create
Access rows and columns
Edit
Add new rows and columns
vec1=c(1,2,3)
vec2=c("r","scilab","java")
vec3=c("for prototyping", "for prototyping", "for scale up")
df=data.frame(vec1,vec2,vec3)
print(df)
Page 29 of 29
Data frame are generic data objects of R which are used to store the tabular data.
Data frames are the most popular data objects in R programming because we are
comfortable in seeing the data in the tabular form.
Data frames can also be taught as matrices where each column of a matrix can be
of different data type.
To create data frame, use the command data.frame() and pass each element as
argument to this function.
Data frames can also be created by importing the data from text file, using the
function read.table(),
Syntax:
new_data_frame = read.table(file = “path of the file”, sep)
Page 30 of 30
The separator can also be a comma or a tab etc.
Data frames can be created either on the go or using the data that already exists in
some format.
Thus a data frame was created
Page 31 of 31
RUN – 1
vec1=c(1,2,3)
vec2=c("r","scilab","java")
vec3=c("for prototyping", "for prototyping", "for scale up")
df=data.frame(vec1,vec2,vec3)
print(df[1:2,])
Output
vec1 vec2 vec3
1 1 r for prototyping
2 2 scilab for prototyping
Page 32 of 32
RUN – 2
vec1=c(1,2,3)
vec2=c("r","scilab","java")
vec3=c("for prototyping", "for prototyping", "for scale up")
df=data.frame(vec1,vec2,vec3)
print(df[,1:2])
Output
vec1 vec2
1 1 r
2 2 scilab
3 3 java
RUN – 3
vec1=c(1,2,3)
vec2=c("r","scilab","java")
vec3=c("for prototyping", "for prototyping", "for scale up")
df=data.frame(vec1,vec2,vec3)
print(df[1:2])
Output
vec1 vec2
1 1 r
2 2 scilab
3 3 java
Subset
Page 33 of 33
Output
Name month BS BP
1 Senthil Jan 141.2 90
2 Senthil Feb 139.3 78
3 Sam Jan 135.2 80
4 Sam Feb 160.1 81
[1] "new subset pd2"
Name month BS BP
1 Senthil Jan 141.2 90
2 Senthil Feb 139.3 78
4 Sam Feb 160.1 81
Page 34 of 34
Editing Data Frames
vec1=c(1,2,3)
vec2=c("R","Scilab","Java")
vec3=c("For prototyping","For prototyping","For Scale up")
df=data.frame(vec1,vec2,vec3)
print(df)
df[[2]][2]="R"
print(df)
Output
vec1 vec2 vec3
1 1 R For prototyping
2 2 Scilab For prototyping
3 3 Java For Scale up
vec1 vec2 vec3
1 1 R For prototyping
2 2 R For prototyping
3 3 Java For Scale up
Page 35 of 35
To use edit command, create an instance of data frame, using data.frame()
command
This creates an empty data frame and uses this edit command to edit the entries in
data frame
To add extra row use the command rbind and to add an extra column use the command
cbind.
Note: data type in each column should be = to the data types that are already existing
rows
Page 36 of 36
vec1=c(1,2,3)
vec2=c("R","Scilab","Java")
vec3=c("For prototyping","For prototyping","For Scale up")
df=data.frame(vec1,vec2,vec3)
df=rbind(df,data.frame(vec1=4,vec2="C",vec3="For Scale up"))
print("Adding Extra Row")
print(df)
df=cbind(df,vec4=c(10,20,30,40))
print("Adding Extra Column")
print(df)
Output
[1] "Adding Extra Row"
vec1 vec2 vec3
1 1 R For prototyping
2 2 Scilab For prototyping
3 3 Java For Scale up
4 4 C For Scale up
[1] "Adding Extra Column"
vec1 vec2 vec3 vec4
1 1 R For prototyping 10
2 2 Scilab For prototyping 20
3 3 Java For Scale up 30
4 4 C For Scale up 40
Page 37 of 37
Deleting Rows and Columns
Access the row first and insert the negative (-) sign before that close
For example:
To delete the 3rd row, select and insert a negative sign.
And in the above example, this exclamatory symbol says no to the columns that are
having column name vector 3
Page 38 of 38
vec1=c(1,2,3)
vec2=c("R","Scilab","Java")
vec3=c("For prototyping","For prototyping","For Scale up")
df=data.frame(vec1,vec2,vec3)
df=rbind(df,data.frame(vec1=4,vec2="C",vec3="For Scale up"))
#print("Adding Extra Row")
#print(df)
df=cbind(df,vec4=c(10,20,30,40))
#print("Adding Extra Column")
#print(df)
df2=df[-3,-1]
print(df2)
Output
vec2 vec3 vec4
1 R For prototyping 10
2 Scilab For prototyping 20
4 C For Scale up 40
[Execution complete with exit code 0]
vec1=c(1,2,3)
vec2=c("R","Scilab","Java")
vec3=c("For prototyping","For prototyping","For Scale up")
df=data.frame(vec1,vec2,vec3)
df=rbind(df,data.frame(vec1=4,vec2="C",vec3="For Scale up"))
#print("Adding Extra Row")
#print(df)
df=cbind(df,vec4=c(10,20,30,40))
#print("Adding Extra Column")
#print(df)
#df2=df[-3,-1]
#print(df2)
#conditional deletion
df3=df[,!names(df)%in%c("vec3")]
print(df3)
df4=df[!df$vec1==3,]
print(df4)
Output
vec1 vec2 vec4
1 1 R 10
2 2 Scilab 20
3 3 Java 30
4 4 C 40
vec1 vec2 vec3 vec4
1 1 R For prototyping 10
2 2 Scilab For prototyping 20
4 4 C For Scale up 40
[Execution complete with exit code 0]
Page 39 of 39
05.11.2022
Data Entry with R’s Text Editor
https://youtu.be/sAit4ctcX2Q
vec1=c(1,2,3)
vec2=c("R","Scilab","Java")
vec3=c("For prototyping","For prototyping","For Scale up")
df=data.frame(vec1,vec2,vec3)
print(df)
df[[2]][2]="R"
print(df)
df=rbind(df,data.frame(vec1=4,vec2="C",vec3="For Scale up"))
df=cbind(df,vec4=c(10,20,30,40))
print(df)
df2=df[-3,-2]
print(df2)
df3=df[,!names(df) %in% c("vec3")]
print(df3)
df4=df[!df$vec1==3,]
print(df4)
df[3,1]=3.1
df[3,3]="others"
print(df)
Page 40 of 40
Output
vec1 vec2 vec3
1 1 R For prototyping
2 2 Scilab For prototyping
3 3 Java For Scale up
vec1 vec2 vec3
1 1 R For prototyping
2 2 R For prototyping
3 3 Java For Scale up
vec1 vec2 vec3 vec4
1 1 R For prototyping 10
2 2 R For prototyping 20
3 3 Java For Scale up 30
4 4 C For Scale up 40
vec1 vec3 vec4
1 1 For prototyping 10
2 2 For prototyping 20
4 4 For Scale up 40
vec1 vec2 vec4
1 1 R 10
2 2 R 20
3 3 Java 30
4 4 C 40
vec1 vec2 vec3 vec4
1 1 R For prototyping 10
2 2 R For prototyping 20
4 4 C For Scale up 40
vec1 vec2 vec3 vec4
1 1.0 R For prototyping 10
2 2.0 R For prototyping 20
3 3.1 Java others 30
4 4.0 C For Scale up 40
Page 41 of 41
Factor issue – R has inbuilt characteristic to assign the data types to the data we enter
(as categories or factors levels)
And it assumes that these are the only factors that are available for now;
When changing the 3rd row, 3rd column to “others”, it will display warning message
saying that, this others categorical variable is not available and it replaces that with the
NA (see the use of word factor in the warning message)
Page 42 of 42
During entering new entries in R, it should be consistent with the factor levels that are
already defined, if not those error message will be printed out.
To avoid this, while defining the data frame itself, pass another argument, which says
strings as factors is false, by default this argument is true, that is the reason for the
warning message
If the same operations are carried out now, then there won’t be any warning messages.
Page 43 of 43
Recasting
Need to recast data frames
Recast in 2 steps: Melt, Cast
Recast in 1 step – recast
Joining of 2 data frames – left join, right join, inner join
Output
Name month BS BP
Page 44 of 44
1 Senthil Jan 141.2 90
2 Senthil Feb 139.3 78
3 Sam Jan 135.2 80
4 Sam Feb 160.1 81
To recast the data frame into another form, 2 steps were used, first one is melt and the
second one is cast.
To use melt and cast command to recast, identifier variables and measurement
variables of the data frame are to be identified.
Most of the discrete type variables can be identifier variables, the numeric variables can
be measurement variables, and there are certain rules for the measurement variables
such as, categorical and date variables cannot be measurement variables.
Page 45 of 45
This melt command is available in the reshape2 library.
Output
Name month variable value
1 Senthil Jan BS 141.2
2 Senthil Feb BS 139.3
3 Sam Jan BS 135.2
4 Sam Feb BS 160.1
5 Senthil Jan BP 90.0
6 Senthil Feb BP 78.0
7 Sam Jan BP 80.0
8 Sam Feb BP 81.0
For the melt command, give the data frame as first argument;
Page 46 of 46
Specify the identifier variables and the measurement variables in the data frame
Melt command was used to melt the data frame to get to this structure
Page 47 of 47
dcast() function is also available in reshape2 library
During this step, columns from which the values are going to be taken – are to be
specified
Using dcast(), another frame df2 – to be created, in which variable and month will be
constant, blood sugar and blood pressure to be the variables of importance
Convert the name variable into 2 columns (number of columns depending upon the
number of categories in the name)
Columns variable and month remain as it is and the categories in the name becomes
new variable.
2 categories are there in the example, which are Sam and Senthil and they will become
the new columns, the values for those variables has to be picked from the column value.
Output
variable month Sam Senthil
1 BS Feb 160.1 139.3
2 BS Jan 135.2 141.2
3 BP Feb 81.0 78.0
4 BP Jan 80.0 90.0
Page 48 of 48
Recasting in a single step
Recasting can be
performed in a single step,
using recast function.
Page 49 of 49
pd=data.frame("Name"=c("Senthil","Senthil","Sam","Sam"),
"month"=c("Jan","Feb","Jan","Feb"),
"BS"=c(141.2,139.3,135.2,160.1),
"BP"=c(90,78,80,81))
library(reshape2)
recast(pd,variable+month~Name, id.var=c("Name","month"))
Output
variable month Sam Senthil
1 BS Feb 160.1 139.3
2 BS Jan 135.2 141.2
3 BP Feb 81.0 78.0
4 BP Jan 80.0 90.0
Melt and Cast operations can be done together using the recast command.
09.11.2022
Page 50 of 50
To create a new variable, mutate() is used - load the library dplyr, use mutate
command, pass the data frame and new variable as argument
pd=data.frame("Name"=c("Senthil","Senthil","Sam","Sam"),
"month"=c("Jan","Feb","Jan","Feb"),
"BS"=c(141.2,139.3,135.2,160.1),
"BP"=c(90,78,80,81))
library(dplyr)
pd2 <- mutate(pd,log_BP=log(BP))
print(pd2)
Page 51 of 51
How to join 2 data frames?
It is very important in terms of data analysis, because part of the data may from one
source and the part of the data from other source, to match these data, some
common IDs are used.
The Id variable is common to both data frames; which means that variable is
available in both data frames which are to be combined
This variable provides the identifiers for combining the 2 data frames
The nature of combination depends upon the function that is being used.
Page 52 of 52
Example: Create 2 data frames
1) pd
2) pd_new
Data frame 1: pd
pd=data.frame("Name"=c("Senthil","Senthil","Sam","Sam"),
"month"=c("Jan","Feb","Jan","Feb"),
"BS"=c(141.2,139.3,135.2,160.1),
"BP"=c(90,78,80,81))
print(pd)
Name month BS BP
1 Senthil Jan 141.2 90
2 Senthil Feb 139.3 78
3 Sam Jan 135.2 80
4 Sam Feb 160.1 81
The variable
department will be
added to the final
data frame, only
for Senthil and
Sam.
Page 53 of 53
pd=data.frame("Name"=c("Senthil","Senthil","Sam","Sam"),
"month"=c("Jan","Feb","Jan","Feb"),
"BS"=c(141.2,139.3,135.2,160.1),
"BP"=c(90,78,80,81))
#print(pd)
pd_new=data.frame("Name"=c("Senthil","Ramesh","Sam"),
"dept"=c("PSE","Data Analytics","PSE"))
#print(pd_new)
library(dplyr)
# new dataframe name
pd_left_join1<-left_join(pd,pd_new,by="Name")
print(pd_left_join1)
Page 54 of 54
pd=data.frame("Name"=c("Senthil","Senthil","Sam","Sam"),
"month"=c("Jan","Feb","Jan","Feb"),
"BS"=c(141.2,139.3,135.2,160.1),
"BP"=c(90,78,80,81))
#print(pd)
pd_new=data.frame("Name"=c("Senthil","Ramesh","Sam"),
"dept"=c("PSE","Data Analytics","PSE"))
#print(pd_new)
library(dplyr)
# new dataframe name
pd_right_join1<-right_join(pd,pd_new,by="Name")
print(pd_right_join1)
right join – joins matching rows of data frame 1 to the data frame 2 based on the Id
variable
2 data frames were passed as arguments, 2nd data frame will be the reference, when
Page 55 of 55
there is no matching values, it will be filled with NAs
pd=data.frame("Name"=c("Senthil","Senthil","Sam","Sam"),
"month"=c("Jan","Feb","Jan","Feb"),
"BS"=c(141.2,139.3,135.2,160.1),
"BP"=c(90,78,80,81))
#print(pd)
pd_new=data.frame("Name"=c("Senthil","Ramesh","Sam"),
"dept"=c("PSE","Data Analytics","PSE"))
#print(pd_new)
library(dplyr)
# new dataframe name
pd_right_join1<-right_join(pd_new,pd,by="Name")
print(pd_right_join1)
Order can be changed while passing the arguments, in the above example, pd – data
frame is the reference variable, the output is similar to left join.
Page 56 of 56
Name month BS BP dept
1 Senthil Jan 141.2 90 PSE
2 Senthil Feb 139.3 78 PSE
3 Sam Jan 135.2 80 PSE
4 Sam Feb 160.1 81 PSE
Inner join merges and retains those rows in the ids present in the both data frames.
Page 57 of 57
11.11.2022
Arithmetic, Logical and Matrix operations in R
Arithmetic Operations in R
Page 58 of 58
R supports all the basic arithmetic operation, the first one is assignment operator.
Either = or the back arrow <- can be used to assign a value to the variable and
standard addition, subtraction, multiplication, division, integer division and remainder
operations are also available in R.
In R, back arrow <- is only the valid assignment operator whereas, in R studio both =
and back arrow are proper assignment operators.
Hierarchy of operations
Page 59 of 59
The hierarchy of operations while performing the arithmetic operations in R is similar
to normal BODMAS rule with bracket has the first importance; exponent has the
second priority and followed by division, multiplication, addition and subtraction.
Value of A in the given example is 5
Logical Operations in R
Page 60 of 60
Page 61 of 61
The important class of operations that are needed for data analysis problems is
matrix operations
Most of the data will be treated as matrices.
Page 62 of 62
A matrix is a rectangular arrangement of numbers in rows and columns in a matrix,
rows are the ones which run horizontally and columns are the ones which run
vertically. These are the examples of matrices.
Page 63 of 63
A=matrix(c(1,2,3,4,5,6,7,8,9),nrow=3,ncol=3,byrow=TRUE)
print(A)
Output
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
A=matrix(c(1,2,3,4,5,6,7,8,9),nrow=3,ncol=3)
print(A)
Output
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
matrix(3,3,4)
Page 65 of 65
Output
[,1] [,2] [,3] [,4]
[1,] 3 3 3 3
[2,] 3 3 3 3
[3,] 3 3 3 3
diag(c(4,5,6),3,3)
Output
[,1] [,2] [,3]
[1,] 4 0 0
[2,] 0 5 0
[3,] 0 0 6
diag(1,3,3)
Output
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
Matrix Metrics
Page 66 of 66
To know the dimension of the matrix, how many rows are there, how many columns
are there, how many elements are there in the matrix, following comments can be
used.
A=matrix(c(1,2,3,4,5,6,7,8,9),nrow=3,ncol=3,byrow=TRUE)
dim(A)
nrow(A)
ncol(A)
length(A)
Output
[1] 3 3
[1] 3
[1] 3
[1] 9
Dimension of A will return the size of the matrix that will say what the size of the
matrix is
nrow of A will return number of rows and ncolumn of A will return number of
columns
Either length of A or product of dimensions of A will return the number of elements
that are existing in the matrix
Page 67 of 67
A=matrix(c(1,2,3,4,5,6,8,9,1),3,3,byrow=T)
colnames(A)<-c("a","b","c")
rownames(A)<-c("d","e","f")
print(A)
Output
abc
d123
e456
f891
A=matrix(c(1,2,3,4,5,6,8,9,1),3,3,byrow=T)
colnames(A)<-c("a","b","c")
rownames(A)<-c("d","e","f")
A[,1:2]
ab
d12
e45
f89
A=matrix(c(1,2,3,4,5,6,8,9,1),3,3,byrow=T)
colnames(A)<-c("a","b","c")
rownames(A)<-c("d","e","f")
A[,c("a","c")]
Output
ac
d13
e46
f81
A=matrix(c(1,2,3,4,5,6,8,9,1),3,3,byrow=T)
colnames(A)<-c("a","b","c")
rownames(A)<-c("d","e","f")
A[c("d","f"),]
Output
abc
d123
f891
Page 68 of 68
A=matrix(c(1,2,3,4,5,6,7,8,9),3,3,byrow=T)
colnames(A)<-c("a","b","c")
rownames(A)<-c("d","e","f")
A[ ]
A[1,2]
A[2,3]
Output
abc
d123
e456
f789
[1] 2
[1] 6
Page 69 of 69
A=matrix(c(1,2,3,4,5,6,7,8,9),3,3,byrow=T)
colnames(A)<-c("a","b","c")
rownames(A)<-c("d","e","f")
A[]
A[,1]
Output
abc
d123
e456
f789
def
147
Page 70 of 70
A=matrix(c(1,2,3,4,5,6,7,8,9),3,3,byrow=T)
colnames(A)<-c("a","b","c")
rownames(A)<-c("d","e","f")
A[]
A[2,]
Output
abc
d123
e456
f789
abc
456
Page 71 of 71
A=matrix(c(1,2,3,4,5,6,7,8,9),3,3,byrow=T)
colnames(A)<-c("a","b","c")
rownames(A)<-c("d","e","f")
A[]
A[,-2]
Output
abc
d123
e456
f789
ac
d13
e46
f79
Page 72 of 72
A=matrix(c(1,2,3,4,5,6,7,8,9),3,3,byrow=T)
colnames(A)<-c("a","b","c")
rownames(A)<-c("d","e","f")
A[]
A[-2,]
Output
abc
d123
e456
f789
abc
d123
f789
Colon operator
Page 73 of 73
1:10
10:1
Output
[1] 1 2 3 4 5 6 7 8 9 10
[1] 10 9 8 7 6 5 4 3 2 1
Page 74 of 74
A=matrix(c(1,2,3,4,5,6,7,8,9),3,3,byrow=T)
A[]
A[1:3,1:2]
A[1:3,-3]
A[,1:2]
Output
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
[,1] [,2]
[1,] 1 2
[2,] 4 5
[3,] 7 8
[,1] [,2]
[1,] 1 2
[2,] 4 5
[3,] 7 8
[,1] [,2]
[1,] 1 2
[2,] 4 5
[3,] 7 8
Page 75 of 75
A=matrix(c(1,2,3,4,5,6,7,8,9),3,3,byrow=T)
A[]
A[c(1,3),1:2]
A[c(1,3),c(1,2)]
Output
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
[,1] [,2]
[1,] 1 2
[2,] 7 8
[,1] [,2]
[1,] 1 2
[2,] 7 8
Page 76 of 76
Matrix Concatenation
Matrix concatenation refers to merging of a row or column to a matrix
Concatenation of a row to a matrix is done using rbind()
Concatenation of a column to a matrix is done using cbind()
Consistency of the dimensions between the matrix and the vector should be
checked before concatenation
Page 77 of 77
A=matrix(c(1,2,3,4,5,6,7,8,9),3,3,byrow=T)
B=matrix(c(10,11,12),1,3,byrow=T)
C=rbind(A,B)
A
B
C
Output
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
[,1] [,2] [,3]
[1,] 10 11 12
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
[4,] 10 11 12
Page 78 of 78
A=matrix(c(1,2,3,4,5,6,7,8,9),3,3,byrow=T)
B=matrix(c(10,11,12),3,1,byrow=T)
C=cbind(A,B)
A
B
C
Output
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
[,1]
[1,] 10
[2,] 11
[3,] 12
[,1] [,2] [,3] [,4]
[1,] 1 2 3 10
[2,] 4 5 6 11
[3,] 7 8 9 12
Page 79 of 79
A=matrix(c(1,2,3,4,5,6,7,8,9),3,3,byrow=T)
B=matrix(c(10,11,12),1,3,byrow=T)
C=cbind(A,B)
A
B
C
Output
Error in cbind(A, B) : number of rows of matrices must match (see arg 2)
Execution halted
To resolve this dimension inconsistency, transpose this B and then have this as 3 by
1 and now A is 3 by 1, now perform the C bind operation by using Cbind command
Cbind of A comma B and assign it to C.
Deleting a Column
Page 80 of 80
A=matrix(c(1,2,3,4,5,6,7,8,9),3,3,byrow=T)
A=A[,-2]
A
Output
[,1] [,2]
[1,] 1 3
[2,] 4 6
[3,] 7 9
To delete a column, use negative symbol before the columns which is to be deleted
and then assign it to A, then the required output is obtained.
Deleting a Row
Page 81 of 81
A=matrix(c(1,2,3,4,5,6,7,8,9),3,3,byrow=T)
A=A[-2,]
A
Output
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 7 8 9
Page 82 of 82
Matrix Algebra
Addition
Subtraction
Multiplication
Matrix operations in R
Matrix Division
Page 83 of 83
Addition
A=matrix(c(1,2,3,4,5,6,8,9,1),3,3,byrow=T)
B=matrix(c(3,1,3,4,2,1,5,1,2),3,3,byrow=T)
A
B
A+B
Output
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 8 9 1
[,1] [,2] [,3]
[1,] 3 1 3
[2,] 4 2 1
[3,] 5 1 2
[,1] [,2] [,3]
[1,] 4 3 6
[2,] 8 7 7
[3,] 13 10 3
Subtraction
A=matrix(c(1,2,3,4,5,6,8,9,1),3,3,byrow=T)
B=matrix(c(3,1,3,4,2,1,5,1,2),3,3,byrow=T)
A
B
A-B
Output
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 8 9 1
[,1] [,2] [,3]
[1,] 3 1 3
[2,] 4 2 1
[3,] 5 1 2
[,1] [,2] [,3]
[1,] -2 1 0
[2,] 0 3 5
[3,] 3 8 -1
Page 84 of 84
Multiplication (Regular Matrix Multiplication)
A=matrix(c(1,2,3,4,5,6,8,9,1),3,3,byrow=T)
B=matrix(c(3,1,3,4,2,1,5,1,2),3,3,byrow=T)
A
B
A%*%B
Output
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 8 9 1
[,1] [,2] [,3]
[1,] 3 1 3
[2,] 4 2 1
[3,] 5 1 2
[,1] [,2] [,3]
[1,] 26 8 11
[2,] 62 20 29
[3,] 65 27 35
A=matrix(c(1,2,3,4,5,6,8,9,1),3,3,byrow=T)
B=matrix(c(3,1,3,4,2,1,5,1,2),3,3,byrow=T)
A
B
A*B
Output
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 8 9 1
[,1] [,2] [,3]
[1,] 3 1 3
[2,] 4 2 1
[3,] 5 1 2
[,1] [,2] [,3]
[1,] 3 2 9
[2,] 16 10 6
[3,] 40 9 2
Page 85 of 85
Matrix Division
A=matrix(c(4,9,16,25),2,2,byrow=T)
B=matrix(c(2,3,4,5),2,2,byrow=T)
A
B
A/B
Output
[,1] [,2]
[1,] 4 9
[2,] 16 25
[,1] [,2]
[1,] 2 3
[2,] 4 5
[,1] [,2]
[1,] 2 3
[2,] 4 5
Page 86 of 86
16.11.2022
Advanced programming in R: Functions
F = function (arguments) {
Statements
}
Page 87 of 87
Page 88 of 88
After saving the function, it should be loaded before invoking or executing in R.
Using the source button available in R Script menu, a function can be loaded.
Page 89 of 89
Passing arguments to the functions
Passing variables as arguments to functions
Passed in the same order as in function definition
Names of the arguments can be used to pass their values in any order
Default values are used if some or all arguments are not passed
Since R returns only one object first create an object called result, which is a list
of volume and surface area, then calculate these results and ask the function to
return one object, that is the result, which contains both volume and surface area.
Page 91 of 91
Inline functions
Page 92 of 92
Looping Functions
Page 93 of 93
apply function – Applies a given function over the margins of a given array
syntax – apply(array, margins, functions,…)
here margins referes to the dimensions of the array along which the function need to
be applied.
a = matrix(1:9,3,3)
a
apply(a,1,sum)
apply(a,2,sum)
Output
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
[1] 12 15 18
[1] 6 15 24
lapply function – is used to apply a function over a list
lapply always returns a list of the same lengths as the input list
Page 94 of 94
Syntax: lapply(list, function,...)
a=matrix(1:9,3,3)
b=matrix(10:18,3,3)
Mylist=list(a,b)
a
b
determinant=lapply(Mylist,det)
determinant
Output
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
[,1] [,2] [,3]
[1,] 10 13 16
[2,] 11 14 17
[3,] 12 15 18
[[1]]
[1] 0
[[2]]
[1] 5.329071e-15
dia=c(1,2,3,4)
len=c(7,4,3,2)
vol=mapply(volcylinder,dia,len)
vol
Output
[1] 5.497787 12.566371 21.205750 25.132741
Page 95 of 95
tapply – tapply is used to apply a function over subset of vectors given by a
combination of factors
Syntax – tapply(vectors, factors, function,….)
id=c(1,1,1,1,2,2,2,3,3)
values=c(1,2,3,4,5,6,7,8,9)
tapply(values,id,sum)
Output
1 2 3
10 18 17
18.11.2022
Control structures
If-else-if family
for loop
nested for loops
for loop with if break
while
Page 96 of 96
Page 97 of 97
#if-elseif-else example
x=6
if(x>7){
x=x+1
}else if(x>8){
x=x+2
}else{
x=x+3
}
x
Output
[1] 9
Page 98 of 98
seq(from=1,to=10,by=3)
Output
[1] 1 4 7 10
seq(from=1,to=10,length=4)
Output
[1] 1 4 7 10
seq(from=1,to=10,by=4)
Output
[1] 1 5 9
Page 99 of 99
n=5
sum=0
for(i in seq(1,n,1)){
sum=sum+i
print(c(sum,i))
}
Output
[1] 1 1
[1] 3 2
[1] 6 3
[1] 10 4
[1] 15 5
Output
[1] 1 1
[1] 3 2
[1] 6 3
[1] 10 4
[1] 15 5
[1] 21 6
Output
[1] 1 1
[1] 2 3
[1] 3 6
[1] 4 10
[1] 5 15
Editor Breakpoints
Editor Breakpoints can be added in RStudio by clicking to the left of the line in
RStudio or pressing Shift+F9 with the cursor on your line.
A breakpoint is same as browser() but it doesn’t involve changing codes.
Breakpoints are denoted by a red circle on the left side, indicating that debug
mode will be entered at this line after the source is run.
# Calling function
function_2("s")
# Call traceback()
traceback()
traceback() function displays the error during evaluations. The call stack is read
from the function that was run(at the bottom) to the function that was
running(at the top). Also we can use traceback() as an error handler which will
display error immediately without calling of traceback.
browser[1]> command in consoles confirms that you are in debug mode. Some
commands to follow:
ls(): Objects available in current environment.
print(): To evaluate objects.
n: To examine the next statement.
s: To examine the next statement by stepping into function calls.
where: To print a stack trace.
c: To leave debugger and continue with execution.
C: To exit debugger and go back to R prompt.
Example:
Simulation Basics
https://pubs.wsb.wisc.edu/academics/analytics-using-r-2019/simulation-
basics.html
Normal Distribution
hist(x)
summary(x)
sd(x)
library(psych)
describe(x)
If you are more interested in evaluating Tails, you could define the random
variable as:
Also use a function called rbernoulli that is part of the purrr package
To simulate test scores of students, where any score between 0 and 100 is
Page 108 of 108
possible.
In the fraction I include the two endpoints (0 and 100) to illustrate how you
would adjust for other bounds. These bounds are defined as
the parameters of the uniform distribution.
To simulate one exam score we would use the following code:
runif(n = 1, min = 0, max = 1
*****