r Programming
r Programming
RECOMMENDED BOOKS
• Beginning Data Science in R Data Analysis, Visualization, and Modelling for the Data Scientist,
Thomas Mailund.
ATTRIBUTES OF R-LANGUAGE
• It is a well-known language for data Science.
• When it comes to data science, R has become the popular used programming language across
the global.
1
• It has amazing community like It has very attractive 9000+ community packages.
• It consists of various in built data packages and lot of sample data available.
AGENDA POINTS
2
1 For Installing & downloading
3
2 Basics in R
4
Figure 4: Different Data Types in R-Programming
Data Types in R
5
1. Numeric (1.2, 5, 7, 3.1415)
2. Integer (1, 2, 3, 4, 5)
3. Complex (3 − 4i)
2. Logical (T rue/F alse)
5. Character (”a”, ”apple”)
6. Factor
class() tells us that we are working with numeric values.
typeof() tells us that we are working with double (i.e.numbers with decimals).
A local variable is declared inside a function and can only be accessed within that function, while a
global variable is declared outside any function and can be accessed from anywhere in the program,
including within other functions.
But to grasp the idea; first, we should know about the following R programming topics:
6
total, Sum, .fine.with.dot, this is acceptable, Number5
and Invalid identifiers in R are tot@l, 5um, fine, TRUE, .0ne
• Constants in R Constants, as the name suggests, are entities whose value cannot be altered.
Basic types of constant are numeric constants and character constants.
• Numeric Constants All numbers fall under this category. They can be of type integer, double
( for double precision floating point numbers) or complex. It can be checked with the typeof()
function.
Numeric constants followed by L are regarded as integer and those followed by i are regarded
as complex.
> typeof(5)
[1] ”double”
> typeof(5L)
[1] ”integer”
> typeof(5i)
[1] ”complex”
> 0XA
[1] 10
> 0xA
[1] 10
> 0x11
[1] 17
> 0x111
[1] ?
7
> 0xf f
[1] 255
> 0XF + 1
[1] 16
> 0XF F F
[1] ?
> 0XAA
[1] ?
> 0XF A
[1] ?
> 0XF B
[1] ?
• Character Constants Character constants can be represented using either single quotes (′ )
or double quotes (”) as delimiters.
8
> ’example’
[1] ”example”
> typeof(”5”)
[1] ”character”
• Built-in Constants Some of the built-in constants defined in R along with their values is
shown below.
> LETTERS
[1] ”A” ”B” ”C” ”D” ”E” ”F” ”G” ”H” ”I” ”J” ”K” ”L” ”M”
[14] ”N” ”O” ”P” ”Q” ”R” ”S” ”T” ”U” ”V” ”W” ”X” ”Y” ”Z”
> letters
[1]
> pi
[1] 3.141593
> month.name
[1] ”January” ”February” ”March” ”April”
[5] ”May” ”June” ”July” ”August”
[9] ”September” ”October” ”November” ”December”
> month.abb
[1] ”Jan” ”Feb” ”Mar” ”Apr” ”May” ”Jun” ”Jul” ”Aug” ”Sep”
[10] ”Oct” ”Nov” ”Dec”
9
2.2 R Operators
• The operators < − and = can be used, almost interchangeably, to assign to variable in the
same environment. For example
10
1.
x1 = 5
x2 = c(15, 16, 17)
x3 = matrix(c(1, 2, 3, 4, 5, 6), nrow = 2)
print(x1)
print(x2)
cat(”\n”)
print(x3)
2.
x1 < −5
x2 < −c(15, 16, 17)
x3 < −matrix(c(1, 2, 3, 4, 5, 6), nrow = 2)
• << − It has the same functionality as < −but act as a global assignment operator.
11
Figure 7: Detail of R-Arithmetic operator.
• Examples are
> x < −5
> y < −16
>x+y
[1] 21
>x−y
[1] -11
>x∗y
[1] 80
> y/x
[1] 3.2
> y%/%x
[1] 3
> y%%x
12
[1] 1
> yx
[1] 1048576
• Examples are
> x < −5
> y < −16
>x<y
[1] T RU E
>x>y
[1] F ALSE
> x <= 5
[1] T RU E
> y >= 20
[1] F ALSE
> y == 16
13
[1] T RU E
> x! = 5
[1] F ALSE
• Operators & and | perform element-wise operation producing result having length of the longer
operand.
• But && and || examines only the first element of the operands resulting into a single length
logical vector.
14
[1] FALSE FALSE FALSE TRUE
> x&&y
[1] FALSE
> x|y
[1] TRUE TRUE FALSE TRUE
> x||y
[1] TRUE
15
R Program to Take Input From User
Here we will learn to take input from a user using readline() function.
• Q2. Make a sequence of 1st 50 natural numbers in R programming. then choose two numbers
10 and 50 and then create a sequence of numbers among them by using the command of
pretty() with equally spacing of 10 points. Then find out the sum of the numbers,
then mean, median, Quartile, range, Inter-quartile Range of the numbers of this
sequence. Also use it five point summary and further plot its box-Plot.
pretty() function in R Language is used to decide sequence of equally spaced round
values.
Syntax: pretty(x, n)
Parameters:
x: It is defined as vector data
16
n: length of the resultant vector
Returns: data vector of equal length interval
• Q: Make a vector of 1st 500 values either chosen randomly or in some definite
order.
• Make a function in R that computes the cube of (a) a number (b) above vector.
v=c(1:500)
cube < −f unction(x) x( 3)
cube(3)
cube(v)
• Q3. Also find the Geometric mean and Harmonic mean in the above problem
• The typeof of the same object is list because data frames are stored as list in the memory but
they are represented as a data frame.
• The class function in R helps us to understand the type of object, for example the output of
class for a data frame is integer.
17
3 DATA STRUCTURES
18
> vec2 < − c (1,2,3,45,100)
[1].....
Its class is numeric
> vec3 < − c (”abc”,”def”,”khan”)
[1].....
Its class will be character
> vec3 < − c (0, 1,1,4,1)
[1].....
Its type will be ?
What happens if you use
> vec4 < − c (0L,1L,1L,4L,1L)
And its type and class will be numeric.
19
single command. Q6: Create the new vector from the 20th position to the 30th position
from Q4 and then concatenate with the Q1.
Q6: Assumed that 20 customers enter a store per minute. Can we generate a sim-
ulation of the number of customers per minute for the next 15 minutes?
Explanation: We describe the process as
• The number of times an event occurs (the observation) And generally, we use R’s
rpois function to generate Poisson random variable values from the Poisson distri-
bution and return the results. The function takes two arguments
Number of observations you want to see
The estimated rate of events for the distribution; is expressed as average events
per period.
20
distribution probability is given below:
21
22
Transpose of a vector
If a is a one-dimensional vector, then its transpose will be represented by b=t(a).
Q6. What are the dimensions of vectors a and b?
Q7. What will be a ∗ b or b ∗ a?
Q8. Is ∗ commutative?
Q9. What will be the matrix multiplication for vectors a and b?
• All attributes of an object can be checked with the attributes() function (dimension
can be checked directly with the dim() function.
• One can check if a variable is a matrix or not with the class() function.
• Matrix can be created using the matrix() function. Its syntax is A < −
matrix(c(1,2,3,4,5,6,7,8,9)).
• Dimension of the matrix can be defined by passing appropriate value for arguments
nrow and ncol.
For example Its syntax is
A < − matrix (c(1,2,3,4,5,6,7,8,9), nrow=3, ncol=3)
• Providing value for both dimension is not necessary. If one of the dimension is
provided, the other is inferred from length of the data. For example Its syntax is
A < − matrix (c(1,2,3,4,5,6,7,8,9),nrow=3)
• Usually R entered the elements in column-wise, but if you wish to entered the
entries in row-wise, then you need to use the command of byrow.
For example Its syntax is
A < − matrix (c(1,2,3,4,5,6,7,8,9),nrow=3, ncol=3, byrow=TRUE)
23
• To compute the traspose, one can use the command
< − t(A)
[1] ??
• Let’s create a matrix of order 2x4. Then create a vector of two elements and show
its multiplication with the matrix.
24
• Diagonal matrix; diag(k,m,n)
k indicate about the constants or number of elements that you needed to you at
diagonal position.
For example diag(a,3,3) or diag(c(a,b,c),3,3)
• How to access the entries of a matrix or the values either from row or column from
the said matrix A?
> A[,m:n] pick the entries from all the rows, chosen from mt h column to nt h
column.
> A[m:n,] pick the entries from all the columns, chosen from mt h row to nt h
row.
> A[2:3,] ?
25
> A[1,] ?
> A[m,n] indicate about the entry that lies in the mth row and nth column.
• > A[,-m] to delete the mth column from all the rows of matrix A.
• >A[,-m] to delete the mth column from all the rows of matrix A.
For example A[-3,]: means delete the third row from all the columns.
Q: Make a sub-matrix in three different ways from matrix of order 4x5 of which
contains the elements from row1 to row 3 but also from column 2 to 5.
Q: How to access the entries from row 1 and row 3, then from row 1 to row3
Matrix Concatenation
It means merging of a row or column to a matrix. Its syntax is
• > rbind(M,M1)
• > cbind(M,M1)
It shows error that number of rows should match.
By default, M1 is a row matrix. so needed to take its transpose to make it a
26
column matrix M2=t(M1) then use
> cbind(M,M2)
• Elements of different types can be stored and their individual type can be intact.
• For example
<− li = list(1, “a”, T RU E)
The said three elements divided into three components.
27
<− typeof(li[[1]])
[1] double
<− typeof(li[[2]])
[1] character
<− typeof(li[[3]])
[1] logical
• Lets try to store the three vectors by using the list command.
<− p = list(c(1, 2, 3), c(“a”, “b”, “c”), c(T, F, T ))
Now how to extract the elements of that data frame.
How to extract a from the above data frame?
3.5 Array
How to create an array of vectors ?
28
• To extract the values of matrices from it. For example
> a= array(c(vec1,vec2), dim=c(2,3,2))
29
Q1. Write a R program to convert a given matrix to a 1 dimensional array.
Q2. Write a R program to create an array of two 3x3 matrices each with 3 rows and 3 columns from two
given two vectors.
Q3. Write a R program to create an 3 dimensional array of 24 elements using the dim() function.
Q4. Write a R program to create an array of two 3x3 matrices each with 3 rows and 3 columns from two
given two vectors.
Q5. Use any 30 natural numbers to write a R program to create an array of four given columns, three
given rows, and two given tables and display the content of the array.
Q6. Use the syntax "seq(from, to, by, length.out, along.with)" to generate sequence of even numbers
greater than 50.
Write a R program to create a two-dimensional 5×3 array of sequence of even integers greater than 50.
Where:
Solution:
v=1:12
m=matrix(v,3,4)
print(m)
a = as.vector(m)
print(a)
Q2. Write a R program to create an array of two 3x3 matrices each with 3 rows and 3 columns from two
given two vectors.
Solution:
v1 = c(1,3,4,5)
v2 = c(10,11,12,13,14,15)
print(v1)
print(v2)
print("New array:")
print(a1)
Q3. Write a R program to create an 3 dimensional array of 24 elements using the dim() function.
Solution:
v
dim(v) = c(3,2,4) % Here we are setting the dimenisons of vector v as 3X2 and searching 4 matrices of
order 3 x 4.
print(v)
Q4. Write a R program to create an array of two 3x3 matrices each with 3 rows and 3 columns from two
given two vectors.
Print the array. Then print the second row of the second matrix of the array and the element in the 3rd
row and 3rd
Solution:
v1=c(1:5)
v2=c(12,111,10)
a=array(c(v1,v2),dim=c(3,3,2))
a[2,,2] % 1st index indicate the row position, 2nd about column and 3rd index inform about the
specified matrix.
a[3,3,1] % It provide the entry lies in the 3rd row and 3rd column, but taken from 1st matrix.
Q5. Use any 30 natural numbers to write a R program to create an array of four given columns,three
given rows, and two given tables and display the content of the array.
Solution:
v=c(1:30)
a=array(v,dim=c(4,3,2) or a=array(1:30,dim=c(4,3,2)
Q6. Use the syntax "seq(from, to, by, length.out, along.with)" to generate sequence of even numbers
greater than 50.
Write a R program to create a two-dimensional 5×3 array of sequence of even integers greater than 50.
Solution:
a=seq(from=50,by=2,length.out=15)
array(a,dim=c(5,3,2))
or
a
3.6 Factor
The Factor is the next data structure. It is a very important tool in machine learning
models (ML) to implement ML models, and you needed numerical data instead of char-
acters.
Factors are used to represent categorical data. Factors are the data objects which are
used to categorize the data and store it as levels. They can store both strings and
integers. They are useful in columns that have a limited number of unique values. Like
”Male, ”Female” and True, False, etc. They are useful in data analysis for statistical
modeling.
Factors can be ordered or random and represent an important class for statistical anal-
ysis and also for plotting. Factors are stored as integers, and have labels associated with
these unique integers. While factors look ( and often behave) like character vectors,
they are actually integers under the hood, and you need to be careful when treating
them like strings.
Once created, factors can only contain a pre-defined set values, known as levels.By de-
fault, R always sorts levels in alphabetical order.
Factors are created using the factor() function by taking a vector as input.
• data = c(”East”, ”W est”, ”East”, ”N orth”, ”N orth”, ”East”, ”W est”, ”W est”, ”W est”, ”East”,
”N orth”)
print(data)
print(is.f actor(data))
Note: is.factor() function in R Language is used to check if the object passed to
the function is a Factor or not. It returns a boolean value as output.
Let’s try this
34
data = c(”East”, ”W est”, ”East”, ”N orth”, ”N orth”, ”East”, ”W est”, ”W est”, ”W est”, ”East”,
”N orth”)
f actor data = f actor(data)
print(f actor data)
Applying the factor function useful to get the new order of the level.
new order data = f actor(f actor data, levels = c(”East”, ”W est”, ”N orth”))
print(new order data)
If we have a factor with 3 levels: Let’s understand it with the help of some more
examples.
• Suppose that you have a variable that records the month’s detail:
x1 = c(”Dec”, ”Apr”, ”Jan”, ”M ar”)
35
• To fix both of these problems with a factor. Let’s create a factor you must start
by creating a list of the valid levels:
M levels = c(”Jan”, ”F eb”, ”M ar”, ”Apr”, ”M ay”, ”Jun”, ”Jul”, ”Aug”, ”Sep”, ”Oct”, ”N ov”, ”Dec”)
• Now you people can create a factor after doing this work:
y1 = f actor(x1, levels = M levels)
y1
[1]DecAprJanM ar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
sort(y1)
[1]JanM arAprDec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
• And for any of the values not in the set will be silently converted to NA: For
example
y2 = f actor(x2, levels = M levels)
y2
[1]DecApr < N A > M ar
Levels : JanF ebM arAprM ayJunJulAugSepOctN ovDec
• If you omit the levels, they’ll be taken from the data in alphabetical order:
f actor(x1)
[1]DecAprJanM ar
Levels : AprDecJanM ar
• Sometimes one would prefer that the order of the levels should match the order
of the first appearance in the data. You can do that when creating the factor by
setting levels to unique(x)
f 1 = f actor(x1, levels = unique(x1))
f1
[1]DecAprJanM ar
Levels : DecAprJanM ar
36
• If you ever need to access the set of valid levels directly, you can do so with levels():
levels(f 2)
[1]”Dec””Apr””Jan””M ar”
• Another example:
The factor() command is used to create and modify factors in R:
> color < − c(”red”,”blue”, ”green”)
> factor(color)
[1] red blue green
Levels: blue red green
It is informed that levels are adjusted alphabetically.
Note: To remove the variables from the environment window, one can use the
command of rm() or to clear the list from the environment window, you can use
rm(list=ls()).
37
• Another example is
height = c(132, 151, 162, 139, 166, 147, 122)
weight = c(48, 49, 66, 53, 67, 52, 40)
gender = c(”male”, ”male”, ”f emale”, ”f emale”, ”male”, ”f emale”, ”male”)
input data = data.f rame(height, weight, gender)
print(input data)
print(is.f actor(input data$gender)) To test if the gender column is a factor.
print(inputd ata$gender) To print the gender column, see the levels.
Note: Apply the command of data frame to show the tables from Q2.20 to Q2.29 from
the recommended book of Statistics.
• Let’s create a vector having numerical values x = c(123, 54, 23, 876, N A, 134, 2346, N A)
Calculates the sum and removes the NA values from the summation by using the
command of sum(x, na.rm = T RU E), otherwise the answer is N.A.
Note: Argument na.rm gives a simple way of removing missing values from data
if they are coded as NA. In base R its standard default value is FALSE, meaning,
NA’s are not removed.
38
3.8 Inbuilt Functions in R
To view all the available datasets use the data() function, it will display all the
datasets available with R installation.
Let’s do some work with data frames, named iris. Numerous guides have been writ-
ten on the exploration of this widely known dataset. Iris, introduced by Ronald
Fisher in his 1936 paper, The use of multiple measurements in taxonomic prob-
lems, contains three plant species (setosa, virginica, versicolor) and four features
measured for each sample. These quantify the morphologic variation of the iris
flower in its three species, all measurements given in centimeters.
It can be loaded and viewed by the command of
– data(iris)
– View(iris)
The iris dataset is a built-in data-set in R that contains measurements on 4
different attributes (in centimeters) for 150 flowers from 3 different species i.e.
”Setosa”, ”versicolor”, and ”virginica”.
– class(iris)
[1]”data.f rame”
– str(iris)
This command is used to view the structure of the data frame of iris.
39
– To get the dimensions of the data-set in terms of the number of rows and the
number of columns, we can use the dim() function.
> dim(iris)
[1] 150 5 The data set contains 150 rows and 5 columns.
– To display the column names of the data frame, we can use the names() func-
tion > names(iris)
[1] ”Sepal.Length” ”Sepal.Width” ”Petal.Length” ”Petal.Width” ”Species”
– head(iris)
It is used to get the first six records.
– head(iris,n)
It is used to conceive the first n records.
– tail(iris)
It is used to conceive the last six records.
40
– How to retrieve the columns of data frame iris?
For that, use the following commands.
> iris$Sepal.length etc
– table()
The table() function in R can be used to quickly create frequency tables.
Let’s use this command on all the columns?
– table(iris$Species)
It provides the frequency tab. As the frequency of each level is provided.
– To quickly summarize each variable in the data-set, we can use the summary()
function.
– The following code shows how to use prop.table() to create a frequency table
of proportions for the position variable in our data frame:
For example
prop.table(table(iris$Species))
41
– To construct the frequency Table for Two Variables use the code for 1st six
data elements of data set iris()
table(iris$Sepal.Length,iris$Sepal.Width)
– To construct the Frequency Table of Proportions for Two Variables, use the
code for 1st six data elements of data set iris()
prop.table(table(iris$Sepal.Length,iris$Sepal.Width))
42
Figure 13: Histogram of column Sepal length of data set IRIS
43
Figure 14: Scatter diagram columns Sepal Width vs Sepal Length of data set IRIS
Q: Repeat the above process for the data set of airquality, cars , iris3, quakes etc ?
To create a boxplot by group, We can also use the command of boxplot() function
For boxplot Sepal.Length vs. Species, use the command
boxplot(Sepal.Length Species, data = iris, main =′ SepalLengthbySpecies′ , xlab =′ Species′ , ylab =′
SepalLength′ , col =′ steelblue′ , border =′ black ′ )
44
Figure 15: Boxplot of Sepal Length vs Species of data set IRIS
– The x-axis displays the three species and the y-axis displays the distribution
of values for sepal length for each species.
– This type of plot allows us to quickly see that the sepal length tends to be
largest for the virginica species and smallest for the setosa species.
45
3.9 How to import the data
– Excel file
– CSV file; It is named as a comma-separated values file. It is a delimited text
file that uses a comma to separate values. Each line of the file is a data record.
– Tab-delimited text; It is a file containing tabs that separate information with
one record per line. A tab-delimited file is often used to upload data to a
system. The most common program used to create these files is Microsoft
Excel.
Some of examples to import files are
– x = read excel(“D : /IM E−M A−240+L−P robandStat/R−Language/csvorexcelData/LungCa
– y = read excel(”D : /IM E−M A−240+L−P robandStat/R−Language/csvorexcelData/smoker
46
> fivenum(cars$speed)
> quantile(cars$speed)
> quantile(cars$speed,0.5)
> quantile(cars$speed,0.75)
> quantile(cars$speed,0.95)
> IQR(cars$speed)
> range(cars$speed)
> min(cars$speed)
> max(cars$speed)
> var(cars$speed)
> sd(cars$speed)
> sqrt(var(cars$speed))
– > data() You will get a family of data set frames where one of the data is
related to cars. Students, please view it.
Linear regression is a regression model that uses a straight line to describe the
relationship between variables. It finds the line of best fit through your data by
searching for the value of the regression coefficient(s) that minimizes the total er-
ror of the model.
X 3 5 6 9 10 12 15 20 22 28
Y 10 12 15 18 20 22 27 30 32 34
47
First, create these two variables in the R-console window or in the script file.
We can now create a simple plot of the two variables as follows:
plot(X, Y)
We can improve this plot by using various arguments within the plot() command.
Copy and paste the following code into the R workspace:
plot(X,Y, pch = 16, cex = 1.3, col = ”blue”, main = ”Testing of experimental
data”, xlab = ”Loads”, ylab =”Length”)
48
Figure 17: Improved Scatter Plot
In the above code, the syntax pch = 16 creates solid dots, while cex = 1.3 creates
dots that are 1.3 times bigger than the default (where cex = 1). More about these
commands later.
Now let’s perform a linear regression using lm() on the two variables by adding
the following text at the command line: lm(Y X)
49
Figure 18: Results of Regression Line
Note that the intercept is 8.804 and the slope is 1.015. And, the way lm stands
for “linear model”.
To add a best-fit line or regression line to our plot by adding the following text at
the command line:
abline(y-intercept,slope)
abline(8.804, 1.015)
50
Figure 19: Graph of Regression Line
We can also try the syntax ”abline(lm(Y X))” useful to plot the regression line.
Now we can use several R diagnostic plots and influence statistics to diagnose how
well our model is fitting the data. These diagnostic plots include:
To use R’s regression diagnostic plots, we set up the regression model as an object
and create a plotting environment of two rows and two columns. Then we use the
plot() command, treating the model as an argument.
51
model =lm(Y X)
par(mfrow = c(2,2))
par(bg = gray(.9)) Optional command to create a light gray background
plot(model)
Note: par(mfrow) is useful to arrange multiple plots in the same plotting space.
par(mfrow = c(2,2)) means multiple plots with two rows and two columns. The
mfrow and mfcol parameters allow you to create a matrix of plots in one plotting
space.
In the next step, we will walk you through linear regression in R by using a sample
dataset named income.data. Some of the shots to import files are mentioned below.
52
Figure 21: How to import the csv file
53
Figure 22: How to import the csv file
54
Figure 23: How to import the csv file
Q. Find out the scatter plot, and regression coefficients, and fit a line (an approx-
imation) to the above data to explore the relationship between the variables of
income and happiness. Also, find the correlation coefficients between these vari-
ables by using the command cor(A, B) and then interpret. Also use the commands
model =lm(Y X)
par(mfrow = c(2,2))
plot(model) we get the complete layout as
55
Figure 24: Layout
56