0% found this document useful (0 votes)
4 views

r Programming

The document discusses the R programming language, highlighting its attributes, installation process, and basic programming concepts. It includes recommended books, data types, operators, and exercises for practice, focusing on data structures and user input. Additionally, it introduces the Poisson distribution and its applications in various fields.

Uploaded by

momina
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

r Programming

The document discusses the R programming language, highlighting its attributes, installation process, and basic programming concepts. It includes recommended books, data types, operators, and exercises for practice, focusing on data structures and user input. Additionally, it introduces the Poisson distribution and its applications in various fields.

Uploaded by

momina
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Language/Programming of Data Science

Dr. Kashif Ali Khan

13th January, 2025

RECOMMENDED BOOKS
• Beginning Data Science in R Data Analysis, Visualization, and Modelling for the Data Scientist,
Thomas Mailund.

• The R Software Fundamentals of Programming and Statistical Analysis, Pierre Lafaye de


Micheaux R´emy Drouilhet Benoit Liquet.

ATTRIBUTES OF R-LANGUAGE
• It is a well-known language for data Science.

• It is an open–source programming language used for statistical computing or processes.

• It is a well-known language for Data Science.

• Its rank is at the top. It is more than the Python language.

• It is the most popular language today in data science.

• It was inspired by S+-programming.

• It is similar to S-programming language.

• When it comes to data science, R has become the popular used programming language across
the global.

• It is optimized for vector operations.

1
• It has amazing community like It has very attractive 9000+ community packages.

Some of the other features are:

• It is a free source programming language.

• Non-Coders can also do well.

• It can be integrated with some languages like C, C++, JAVA , Python.

• It consists of various in built data packages and lot of sample data available.

AGENDA POINTS

Figure 1: Layout of R-Programming

2
1 For Installing & downloading

Figure 2: Installation of R-Programming

• Use the link https://cran.r-project.org/


After that, we need an IDE (integrated development environment named as R Studio)

• For R Studio, Use the link https://www.rstudio.com/products/rstudio/

3
2 Basics in R

Figure 3: Definition of a variable

4
Figure 4: Different Data Types in R-Programming

Can we compare two complex numbers in R?


To check the class, one can do the following
class(....) like
> class(real no) output= numeric
> class(complex no) output= complex
> class(”my name is khan”) output= character

Data Types in R

Individual Height Sex


A 31 Male
B 15.3 Male
C 20.5 Female
D 17.2 Male
E 25 Female

5
1. Numeric (1.2, 5, 7, 3.1415)
2. Integer (1, 2, 3, 4, 5)
3. Complex (3 − 4i)
2. Logical (T rue/F alse)
5. Character (”a”, ”apple”)
6. Factor
class() tells us that we are working with numeric values.
typeof() tells us that we are working with double (i.e.numbers with decimals).
A local variable is declared inside a function and can only be accessed within that function, while a
global variable is declared outside any function and can be accessed from anywhere in the program,
including within other functions.
But to grasp the idea; first, we should know about the following R programming topics:

2.1 R Variables and Constants


• Variables are used to store data, whose value can be changed according to our need. The
unique name given to the variable (function and objects as well) is an identifier.
A unique name given to a variable (function or object as well) is called an identifier. Identifiers
can have a combination of letters, digits, one period . and one underscore . However, they
must start with a letter or a period.
Some of the rules while writing Identifiers in R are
1. Identifiers can be a combination of letters, digits, period (.) and underscore ( ).
2. It must start with a letter or a period. If it starts with a period, it cannot be followed by a
digit.
3. Reserved words in R cannot be used as identifiers. For example
var name2. < −7 runs
var name&< −7 not runs
var name < −5 runs
4thvar < −7 not runs
.V ar name < −6 runs
.2var < −3 not runs
V ar name < −9 runs
Some of the other valid identifiers in R are

6
total, Sum, .fine.with.dot, this is acceptable, Number5
and Invalid identifiers in R are tot@l, 5um, fine, TRUE, .0ne

• Constants in R Constants, as the name suggests, are entities whose value cannot be altered.
Basic types of constant are numeric constants and character constants.

• Numeric Constants All numbers fall under this category. They can be of type integer, double
( for double precision floating point numbers) or complex. It can be checked with the typeof()
function.
Numeric constants followed by L are regarded as integer and those followed by i are regarded
as complex.
> typeof(5)
[1] ”double”
> typeof(5L)
[1] ”integer”
> typeof(5i)
[1] ”complex”

Numeric constants preceded by 0x or 0X are interpreted as hexadecimal numbers.

> 0XA
[1] 10

> 0xA
[1] 10

> 0x11
[1] 17

> 0x111
[1] ?

7
> 0xf f
[1] 255
> 0XF + 1
[1] 16

> 0XF F F
[1] ?

> 0XAA
[1] ?

> 0XF A
[1] ?

> 0XF B
[1] ?

converts integers to double precision values


a = c(1L, 6L, 10L)
as.double(a)
[1] 1 6 10

identical to as. dou ble()


as.numeric(a)
[1] 1 6 10
converts doubles to integers
as.integer(a)
[1] 1 2 4

• Character Constants Character constants can be represented using either single quotes (′ )
or double quotes (”) as delimiters.

8
> ’example’
[1] ”example”
> typeof(”5”)
[1] ”character”

• Built-in Constants Some of the built-in constants defined in R along with their values is
shown below.
> LETTERS
[1] ”A” ”B” ”C” ”D” ”E” ”F” ”G” ”H” ”I” ”J” ”K” ”L” ”M”
[14] ”N” ”O” ”P” ”Q” ”R” ”S” ”T” ”U” ”V” ”W” ”X” ”Y” ”Z”
> letters
[1]
> pi
[1] 3.141593
> month.name
[1] ”January” ”February” ”March” ”April”
[5] ”May” ”June” ”July” ”August”
[9] ”September” ”October” ”November” ”December”
> month.abb
[1] ”Jan” ”Feb” ”Mar” ”Apr” ”May” ”Jun” ”Jul” ”Aug” ”Sep”
[10] ”Oct” ”Nov” ”Dec”

9
2.2 R Operators

Figure 5: Different Types of Operators in R-Programming

Figure 6: Detail of R-Assignment operator.

• The operators < − and = can be used, almost interchangeably, to assign to variable in the
same environment. For example

10
1.
x1 = 5
x2 = c(15, 16, 17)
x3 = matrix(c(1, 2, 3, 4, 5, 6), nrow = 2)
print(x1)
print(x2)
cat(”\n”)
print(x3)
2.
x1 < −5
x2 < −c(15, 16, 17)
x3 < −matrix(c(1, 2, 3, 4, 5, 6), nrow = 2)

• << − It has the same functionality as < −but act as a global assignment operator.

• Some of the other examples are


> y < −5
>y
[1] 5
>y=9
>y
[1] 9
> 10− > y
>y
[1]

11
Figure 7: Detail of R-Arithmetic operator.

• Examples are
> x < −5
> y < −16
>x+y
[1] 21
>x−y
[1] -11
>x∗y
[1] 80
> y/x
[1] 3.2
> y%/%x
[1] 3
> y%%x

12
[1] 1
> yx
[1] 1048576

Figure 8: Detail of R-Relational operator.

• Examples are
> x < −5
> y < −16
>x<y
[1] T RU E
>x>y
[1] F ALSE
> x <= 5
[1] T RU E
> y >= 20
[1] F ALSE
> y == 16

13
[1] T RU E
> x! = 5
[1] F ALSE

Figure 9: Detail of R-Logical operator.

• Operators & and | perform element-wise operation producing result having length of the longer
operand.

• But && and || examines only the first element of the operands resulting into a single length
logical vector.

• Zero is considered FALSE and non-zero numbers are taken as TRUE

• Some of the examples are


> x < −c(TRUE,FALSE,0,6)
> y < −c(FALSE,TRUE,FALSE,TRUE)
>!x
[1] FALSE TRUE TRUE FALSE
> x&y

14
[1] FALSE FALSE FALSE TRUE
> x&&y
[1] FALSE
> x|y
[1] TRUE TRUE FALSE TRUE
> x||y
[1] TRUE

15
R Program to Take Input From User
Here we will learn to take input from a user using readline() function.

Example: Take input from user

2.3 Practice Exercise


• Q1.Write a R-program to take input from the user(his/her name, age and roll number) and
display the values. Also print the version of R installation.

• Q2. Make a sequence of 1st 50 natural numbers in R programming. then choose two numbers
10 and 50 and then create a sequence of numbers among them by using the command of
pretty() with equally spacing of 10 points. Then find out the sum of the numbers,
then mean, median, Quartile, range, Inter-quartile Range of the numbers of this
sequence. Also use it five point summary and further plot its box-Plot.
pretty() function in R Language is used to decide sequence of equally spaced round
values.
Syntax: pretty(x, n)
Parameters:
x: It is defined as vector data

16
n: length of the resultant vector
Returns: data vector of equal length interval

• Q: Make a vector of 1st 500 values either chosen randomly or in some definite
order.

• Make a function in R that computes the cube of (a) a number (b) above vector.
v=c(1:500)
cube < −f unction(x) x( 3)
cube(3)
cube(v)

• Q3. Also find the Geometric mean and Harmonic mean in the above problem

• The typeof of the same object is list because data frames are stored as list in the memory but
they are represented as a data frame.

• The class function in R helps us to understand the type of object, for example the output of
class for a data frame is integer.

17
3 DATA STRUCTURES

Figure 10: Data structures in R-Programming

3.1 Vectors in R-language


Vector is a homogeneous single dimensional data frame.
Its syntax is:
< −c (element1, element2, element3)
Topics that we need to cover (i) Types of vector
(ii) Algebra of vectors
(iii) Transpose of vectors
(iv) Extract elements from vector
> vec1 < − c (T, F, T)
[1].....
> vec1 + 1
[1].....
Whats its attribute?
Either use the command of typeof or class

18
> vec2 < − c (1,2,3,45,100)
[1].....
Its class is numeric
> vec3 < − c (”abc”,”def”,”khan”)
[1].....
Its class will be character
> vec3 < − c (0, 1,1,4,1)
[1].....
Its type will be ?
What happens if you use
> vec4 < − c (0L,1L,1L,4L,1L)
And its type and class will be numeric.

Lets create some mix vectors.


> mixvec < − c (1,2,T,3,4,F)
> mixvec < −
What will be its type and class ?
> mixvec < − c (”khan”,2,3,”ali”)
> mixvec < −
Its type will be determined by
> typeof(mixvec)
[1] ”character”
Note: Numeric has high precedence than logical. But character has higher precedence than all others.
Rules follows in this order.
character > double > integer > logical

3.2 practice Exercisae


Q1: Draw a vector say u of 1st 15 natural numbers by using a single command. Q2: Draw a vector
say u1 where each element is the thrice of each element of vector u by using a single command. Q3:
Draw a vector say v that contains the multiple of 3 from the first 100 natural numbers.
Q4: Draw a vector of n modulo 3 from Question 1 where n indicates the list of 1st 15
naturals by using a single command. Q5: Combine the above four vectors by using a

19
single command. Q6: Create the new vector from the 20th position to the 30th position
from Q4 and then concatenate with the Q1.

Q6: Assumed that 20 customers enter a store per minute. Can we generate a sim-
ulation of the number of customers per minute for the next 15 minutes?
Explanation: We describe the process as

• A window of observation – a specific time period in which events can occur.

• A rate of occurrence – how often is an event expected to occur in that window?

• The number of times an event occurs (the observation) And generally, we use R’s
rpois function to generate Poisson random variable values from the Poisson distri-
bution and return the results. The function takes two arguments
Number of observations you want to see
The estimated rate of events for the distribution; is expressed as average events
per period.

• The expected syntax is:


rpois(# of observations, rate = rate)

• Poisson distribution is probably one of the most practical statistical distributions


in answering lots of questions in today’s world. It has been used for more than
one century. The use cases can cover various problems from business, banking,
insurance, science, medical, and risk management, just to name a few. Some basic
details and concepts of Poisson distribution prior to looking into its application in
different domains.

• Poisson distribution is a discrete probability distribution named in honor of the


French mathematician and physicist Simeon D. Poisson (1781–1840). It is “dis-
crete” because it shows the probabilities of countable/distinct values. (If you wish
to know more about discrete random variables, you may look up my other article.)

• Poisson distribution is always used in estimating the occurrence of a specified


event that happens during a particular period of time. The formula for Poisson

20
distribution probability is given below:

In 1898, a Russian economist and statistician, Ladislaus Josephovich Bortkiewicz,


published an interesting findings about the probability distribution of Prussian
soldiers accidentally killed by horse-kick. The data was derived from ten army
corps who were observed over 20 years. There were a total of 200 observations and
122 soldiers were killed by horse kick over that 20 years. On average, the number
of death is λ = 122/200

21
22
Transpose of a vector
If a is a one-dimensional vector, then its transpose will be represented by b=t(a).
Q6. What are the dimensions of vectors a and b?
Q7. What will be a ∗ b or b ∗ a?
Q8. Is ∗ commutative?
Q9. What will be the matrix multiplication for vectors a and b?

3.3 MATRIX in R-language


• Matrix is a two dimensional data structure in R programming.

• Matrix is similar to vector but additionally contains the dimension attribute.

• All attributes of an object can be checked with the attributes() function (dimension
can be checked directly with the dim() function.

• One can check if a variable is a matrix or not with the class() function.

• How to create a matrix in R programming?

• Matrix can be created using the matrix() function. Its syntax is A < −
matrix(c(1,2,3,4,5,6,7,8,9)).

• The word matrix is case sensitive.

• Dimension of the matrix can be defined by passing appropriate value for arguments
nrow and ncol.
For example Its syntax is
A < − matrix (c(1,2,3,4,5,6,7,8,9), nrow=3, ncol=3)

• Providing value for both dimension is not necessary. If one of the dimension is
provided, the other is inferred from length of the data. For example Its syntax is
A < − matrix (c(1,2,3,4,5,6,7,8,9),nrow=3)

• Usually R entered the elements in column-wise, but if you wish to entered the
entries in row-wise, then you need to use the command of byrow.
For example Its syntax is
A < − matrix (c(1,2,3,4,5,6,7,8,9),nrow=3, ncol=3, byrow=TRUE)

23
• To compute the traspose, one can use the command
< − t(A)
[1] ??

• To show two matrices successively, one can use the command


< − A;A1
[1] ??

• How to multiply two matrices in R programming ?


< − A ∗ A1
[1] ??
< − A1 ∗ A
[1] ??

• Q: Generate a matrix P of order 4 by 4 which generates random values, another


matrix Q of order 4 by 2.
Also create the third matrix R of order 4 by 4 which generates random values
where minimum can be 5 and maximum can be 11.

• Then show the outputs of P*Q, Q*P, and P*R


Hint: Use runif(n) It create n random values or runif(n, min=a, max=b)
n indicates the total number of data values you required. < − B% ∗ %C for
actual multiplication
[1] ??
< − C% ∗ %D for actual multiplication
[1] ??

• Let’s create a matrix of order 2x4. Then create a vector of two elements and show
its multiplication with the matrix.

Construction of special matrices


• Matrix where all the entries are filled by a constant k; matrix(k,m,n)

24
• Diagonal matrix; diag(k,m,n)
k indicate about the constants or number of elements that you needed to you at
diagonal position.
For example diag(a,3,3) or diag(c(a,b,c),3,3)

• Identity matrix; diag(1,m,n)

To find the size of matrix A


• dim(A): return the size of matrix A

• nrow(A): return the number of rows of matrix A

• ncol(A): return the number of columns of matrix A

• prod(dim(A)) or length(A): return the total number of elements of matrix A.

Note: To create an array with the help of colon sign”:”


> 1:11 It create an array in the increasing order of the elements with an equal spacing
of one unit
> 11:1 It create an array with the deccreasing order.

Accessing/Editing/Deleting the elements in Matrices


• For example
A=Q∗P
colnames(A) < − c(”C-1”,”C-2”, ”C-3”)
rownames(A) < − c(”R-1”,”R-2”, ”R-3”)

• How to access the entries of a matrix or the values either from row or column from
the said matrix A?
> A[,m:n] pick the entries from all the rows, chosen from mt h column to nt h
column.
> A[m:n,] pick the entries from all the columns, chosen from mt h row to nt h
row.
> A[2:3,] ?

25
> A[1,] ?
> A[m,n] indicate about the entry that lies in the mth row and nth column.

• >A[nrow(A),] to access the last row of matrix A.

• >A[,ncol(A)] to access the last column of matrix A.

• > A[,-m] to delete the mth column from all the rows of matrix A.

• >A[,-m] to delete the mth column from all the rows of matrix A.
For example A[-3,]: means delete the third row from all the columns.

Q: Make a sub-matrix in three different ways from matrix of order 4x5 of which
contains the elements from row1 to row 3 but also from column 2 to 5.
Q: How to access the entries from row 1 and row 3, then from row 1 to row3

Q: How to access the entries from column 2 and column 3


Q: How to access the entries from row 1 and row 3 but from column 2 and column 5

Matrix Concatenation
It means merging of a row or column to a matrix. Its syntax is

• rbind() : Concatenation of a row to a matrix.

• cbind() : Concatenation of a column to a matrix.


For example
M < − matrix (c(1,2,3,0,1,0,7,7,7),nrow=3, ncol=3, byrow=TRUE)
M 1 < − matrix (c(66,67,67),nrow=1, ncol=3, byrow=TRUE)

• > rbind(M,M1)

• > cbind(M,M1)
It shows error that number of rows should match.
By default, M1 is a row matrix. so needed to take its transpose to make it a

26
column matrix M2=t(M1) then use
> cbind(M,M2)

• How to find an inverse of a matrix?


> solve (M) where M is any square matrix
How to check is it the correct inverse ?

3.4 List in R-language


Data structures are a logical way or representing as per requirement. They further
help depict this logical view physically in computer memory. In the R language, data
structures can be classified into two groups, namely homogeneous and heterogeneous.
Homogeneous Data Structures: This type can only store a single type of data inside
them(integer, character, etc.)
Heterogeneous Data Structures: This type can store more than one type of data at the
same time.
Heterogeneous Data Structures. R supports two ways of representing heterogeneous
data, namely lists and data-frame. Both structures are discussed in detail below:
1) Lists : Lists are single-dimensional heterogeneous data types. A list can represent
more than one data type at a time i.e. elements of different types can be stored and
their individual types can be intact. We can simply use the list() function to create a
list. Lists are similar to vectors, however, vectors are homogeneous and lists are het-
erogeneous. Another interesting property of lists is that we can store lists inside other
lists(like simple recursion). Due to this reason, Lists are also referred to as “Recursive
Vectors”.

Its Syntax: < − list (element1, element2, element3)

• It stores three or more vectors by using list.

• Elements of different types can be stored and their individual type can be intact.

• For example
<− li = list(1, “a”, T RU E)
The said three elements divided into three components.

27
<− typeof(li[[1]])
[1] double
<− typeof(li[[2]])
[1] character
<− typeof(li[[3]])
[1] logical

• Lets try to store the three vectors by using the list command.
<− p = list(c(1, 2, 3), c(“a”, “b”, “c”), c(T, F, T ))
Now how to extract the elements of that data frame.
How to extract a from the above data frame?

• Some the other examples are


1. list ex1 = list(module = ”Rlanguage”, numbers = 5 : 1, f 1 = F ALSE)
2. list ex2 < −list(list(1, ”Rlanguage”, F ALSE), list(”P ython”, 2, ”Language”), list(”Hello”, F ALS
After execution, you can use either the commands
liste x2orstr(liste x2)

3.5 Array
How to create an array of vectors ?

• It is a multi-dimensional data structure.

• For example if a vector of 6 linear elements may be expressed as > vec1=c(1:6)

• And a vector of 9 elements from 1 to 10 by using the command of linspace(a,b,n)


where a is the initial number, b is the last number and n is total number of points.
> vec2=linspace(a,b,n).

• To get a stack of matrices or to get the collection of vectors of vectors can be


expressed as
array(c(vec1,vec2), dim=c(m,n,p)) where m indicate the number of rows, n the
number of columns and p indicate about the dimension or the number of matrices.

28
• To extract the values of matrices from it. For example
> a= array(c(vec1,vec2), dim=c(2,3,2))

• How to extract 7 from it.


a[1,1,2]

• Please extract 10 from it.


a[2,2,2]

Some of the practice problems in this regard are asked.

Q1. Write a R program to convert a given matrix to a 1 dimensional array.


Q2. Write a R program to create an array of two 3x3 matrices each with 3 rows and 3
columns from two given two vectors.
Q3. Write a R program to create an 3 dimensional array of 24 elements using the dim()
function.
Q4.

29
Q1. Write a R program to convert a given matrix to a 1 dimensional array.

Q2. Write a R program to create an array of two 3x3 matrices each with 3 rows and 3 columns from two
given two vectors.

Q3. Write a R program to create an 3 dimensional array of 24 elements using the dim() function.

Q4. Write a R program to create an array of two 3x3 matrices each with 3 rows and 3 columns from two
given two vectors.

Q5. Use any 30 natural numbers to write a R program to create an array of four given columns, three
given rows, and two given tables and display the content of the array.

Q6. Use the syntax "seq(from, to, by, length.out, along.with)" to generate sequence of even numbers
greater than 50.

Write a R program to create a two-dimensional 5×3 array of sequence of even integers greater than 50.

Hint: Use the

Where:

from = beginning number of the sequence.

To = Terminating the number of the sequence.

by = It is the increment of the given sequence. It is calculated as ((to-from) /(length.out-1)).

length.out = Decides the total length of the sequence

along.with = Outputs a sequence of the same length as the input vector.


Q1. Write a R program to convert a given matrix to a 1 dimensional array.

Solution:

v=1:12

m=matrix(v,3,4)

print(m)

a = as.vector(m)

print("1 dimensional array:")

print(a)

Q2. Write a R program to create an array of two 3x3 matrices each with 3 rows and 3 columns from two
given two vectors.

Solution:

Let's create two vectors

v1 = c(1,3,4,5)

v2 = c(10,11,12,13,14,15)

print(v1)

print(v2)

a1= array(c(v1,v2),dim = c(3,3,2))

print("New array:")

print(a1)

Q3. Write a R program to create an 3 dimensional array of 24 elements using the dim() function.

Solution:

Syntax: sample(values, size_of_subsample)

v = sample(1:5,24,replace = TRUE) % It will make a vector of 24 elements selected randomly from 1 to


5).

v
dim(v) = c(3,2,4) % Here we are setting the dimenisons of vector v as 3X2 and searching 4 matrices of
order 3 x 4.

print(v)

Q4. Write a R program to create an array of two 3x3 matrices each with 3 rows and 3 columns from two
given two vectors.

Print the array. Then print the second row of the second matrix of the array and the element in the 3rd
row and 3rd

column of the 1st matrix.

Solution:

v1=c(1:5)

v2=c(12,111,10)

a=array(c(v1,v2),dim=c(3,3,2))

a[2,,2] % 1st index indicate the row position, 2nd about column and 3rd index inform about the
specified matrix.

a[3,3,1] % It provide the entry lies in the 3rd row and 3rd column, but taken from 1st matrix.

Q5. Use any 30 natural numbers to write a R program to create an array of four given columns,three
given rows, and two given tables and display the content of the array.

Solution:

v=c(1:30)

a=array(v,dim=c(4,3,2) or a=array(1:30,dim=c(4,3,2)

Q6. Use the syntax "seq(from, to, by, length.out, along.with)" to generate sequence of even numbers
greater than 50.

Write a R program to create a two-dimensional 5×3 array of sequence of even integers greater than 50.

Solution:

Hint: Use the


Where:

from = beginning number of the sequence.

To = Terminating the number of the sequence.

by = It is the increment of the given sequence. It is calculated as ((to-from) /(length.out-1)).

length.out = Decides the total length of the sequence

along.with = Outputs a sequence of the same length as the input vector.

a=seq(from=50,by=2,length.out=15)

array(a,dim=c(5,3,2))

or

a =array(seq(from = 50, length.out = 15, by = 2), c(5, 3))

or a =array(seq(from = 50, length.out = 15, by = 2), dim=c(5,3,2))

a
3.6 Factor
The Factor is the next data structure. It is a very important tool in machine learning
models (ML) to implement ML models, and you needed numerical data instead of char-
acters.
Factors are used to represent categorical data. Factors are the data objects which are
used to categorize the data and store it as levels. They can store both strings and
integers. They are useful in columns that have a limited number of unique values. Like
”Male, ”Female” and True, False, etc. They are useful in data analysis for statistical
modeling.
Factors can be ordered or random and represent an important class for statistical anal-
ysis and also for plotting. Factors are stored as integers, and have labels associated with
these unique integers. While factors look ( and often behave) like character vectors,
they are actually integers under the hood, and you need to be careful when treating
them like strings.
Once created, factors can only contain a pre-defined set values, known as levels.By de-
fault, R always sorts levels in alphabetical order.
Factors are created using the factor() function by taking a vector as input.

• data = c(”East”, ”W est”, ”East”, ”N orth”, ”N orth”, ”East”, ”W est”, ”W est”, ”W est”, ”East”,
”N orth”)
print(data)
print(is.f actor(data))
Note: is.factor() function in R Language is used to check if the object passed to
the function is a Factor or not. It returns a boolean value as output.
Let’s try this

• f actor data = f actor(data)


print(f actor data)
print(is.f actor(f actor data))

• Changing the Order of Levels


The order of the levels in a factor can be changed by applying the factor function
again with new order of the levels.

34
data = c(”East”, ”W est”, ”East”, ”N orth”, ”N orth”, ”East”, ”W est”, ”W est”, ”W est”, ”East”,
”N orth”)
f actor data = f actor(data)
print(f actor data)
Applying the factor function useful to get the new order of the level.
new order data = f actor(f actor data, levels = c(”East”, ”W est”, ”N orth”))
print(new order data)

• To generating Factor Levels.


We can generate factor levels by using the gl() function. It takes two integers as
input which indicates how many levels and how many times each level.
gl(n, k, labels)
n is a integer giving the number of levels.
k is a integer giving the number of replications.
labels is a vector of labels for the resulting factor levels.
For example
v = gl(3, 4, labels = c(”Lahore”, ”Islamabad”, ”Kasur”))
print(v)

If we have a factor with 3 levels: Let’s understand it with the help of some more
examples.

• Suppose that you have a variable that records the month’s detail:
x1 = c(”Dec”, ”Apr”, ”Jan”, ”M ar”)

• If you make some typos:


x2 = c(”Dec”, ”Apr”, ”Jam”, ”M ar”)

• And if I wish to sort my data, use the statement


sort(x1)
[1]”Apr””Dec””Jan””M ar”

35
• To fix both of these problems with a factor. Let’s create a factor you must start
by creating a list of the valid levels:
M levels = c(”Jan”, ”F eb”, ”M ar”, ”Apr”, ”M ay”, ”Jun”, ”Jul”, ”Aug”, ”Sep”, ”Oct”, ”N ov”, ”Dec”)

• Now you people can create a factor after doing this work:
y1 = f actor(x1, levels = M levels)
y1
[1]DecAprJanM ar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
sort(y1)
[1]JanM arAprDec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

• And for any of the values not in the set will be silently converted to NA: For
example
y2 = f actor(x2, levels = M levels)
y2
[1]DecApr < N A > M ar
Levels : JanF ebM arAprM ayJunJulAugSepOctN ovDec

• If you want a warning, you can use readr :: parse f actor() :

• If you omit the levels, they’ll be taken from the data in alphabetical order:
f actor(x1)
[1]DecAprJanM ar
Levels : AprDecJanM ar

• Sometimes one would prefer that the order of the levels should match the order
of the first appearance in the data. You can do that when creating the factor by
setting levels to unique(x)
f 1 = f actor(x1, levels = unique(x1))
f1
[1]DecAprJanM ar
Levels : DecAprJanM ar

36
• If you ever need to access the set of valid levels directly, you can do so with levels():
levels(f 2)
[1]”Dec””Apr””Jan””M ar”

• Another example:
The factor() command is used to create and modify factors in R:
> color < − c(”red”,”blue”, ”green”)
> factor(color)
[1] red blue green
Levels: blue red green
It is informed that levels are adjusted alphabetically.

Note: To remove the variables from the environment window, one can use the
command of rm() or to clear the list from the environment window, you can use
rm(list=ls()).

3.7 Data Frame


It is a two dimensional heterogeneous (of different data types) data structures. Exam-
ples are Excel sheet, Tabular arrangement,csv file etc. Lets create the two data frames

• fruit name=c(”Apple”, ”Banana”, ”Guava”)


fruit cost=c(10, 20, 30))
fruits=data.frame(fruit name, fruit cost)
The data is saved in variables as fruits.

• To extract the vectors. Use the commands

• fruits $ fruit name. (Actually it contains the column representation).

• fruits $ fruit cost. (It also contains the column representation).

• To view the complete table, just give the command of


> fruits.

37
• Another example is
height = c(132, 151, 162, 139, 166, 147, 122)
weight = c(48, 49, 66, 53, 67, 52, 40)
gender = c(”male”, ”male”, ”f emale”, ”f emale”, ”male”, ”f emale”, ”male”)
input data = data.f rame(height, weight, gender)
print(input data)
print(is.f actor(input data$gender)) To test if the gender column is a factor.
print(inputd ata$gender) To print the gender column, see the levels.

Note: Apply the command of data frame to show the tables from Q2.20 to Q2.29 from
the recommended book of Statistics.

3.7.1 Sum of values of vectors

• Let’s create a vector having numerical values x = c(123, 54, 23, 876, N A, 134, 2346, N A)
Calculates the sum and removes the NA values from the summation by using the
command of sum(x, na.rm = T RU E), otherwise the answer is N.A.
Note: Argument na.rm gives a simple way of removing missing values from data
if they are coded as NA. In base R its standard default value is FALSE, meaning,
NA’s are not removed.

3.7.2 How to construct the Probability Histogram

Make the histogram of the following examples.

38
3.8 Inbuilt Functions in R

To view all the available datasets use the data() function, it will display all the
datasets available with R installation.

Let’s do some work with data frames, named iris. Numerous guides have been writ-
ten on the exploration of this widely known dataset. Iris, introduced by Ronald
Fisher in his 1936 paper, The use of multiple measurements in taxonomic prob-
lems, contains three plant species (setosa, virginica, versicolor) and four features
measured for each sample. These quantify the morphologic variation of the iris
flower in its three species, all measurements given in centimeters.
It can be loaded and viewed by the command of

– data(iris)
– View(iris)
The iris dataset is a built-in data-set in R that contains measurements on 4
different attributes (in centimeters) for 150 flowers from 3 different species i.e.
”Setosa”, ”versicolor”, and ”virginica”.

– class(iris)
[1]”data.f rame”
– str(iris)
This command is used to view the structure of the data frame of iris.

39
– To get the dimensions of the data-set in terms of the number of rows and the
number of columns, we can use the dim() function.
> dim(iris)
[1] 150 5 The data set contains 150 rows and 5 columns.
– To display the column names of the data frame, we can use the names() func-
tion > names(iris)
[1] ”Sepal.Length” ”Sepal.Width” ”Petal.Length” ”Petal.Width” ”Species”

– setosa: This species occurs 50 times.


– versicolor: This species occurs 50 times.
– virginica: This species occurs 50 times.

– head(iris)
It is used to get the first six records.

Figure 11: First six records of data set IRIS

– head(iris,n)
It is used to conceive the first n records.
– tail(iris)
It is used to conceive the last six records.

40
– How to retrieve the columns of data frame iris?
For that, use the following commands.
> iris$Sepal.length etc
– table()
The table() function in R can be used to quickly create frequency tables.
Let’s use this command on all the columns?
– table(iris$Species)
It provides the frequency tab. As the frequency of each level is provided.
– To quickly summarize each variable in the data-set, we can use the summary()
function.

Figure 12: summary of data set IRIS

– The following code shows how to use prop.table() to create a frequency table
of proportions for the position variable in our data frame:
For example
prop.table(table(iris$Species))

41
– To construct the frequency Table for Two Variables use the code for 1st six
data elements of data set iris()
table(iris$Sepal.Length,iris$Sepal.Width)
– To construct the Frequency Table of Proportions for Two Variables, use the
code for 1st six data elements of data set iris()
prop.table(table(iris$Sepal.Length,iris$Sepal.Width))

Visualize the Iris Dataset

One of the graphical representation is Histogram. To create a histogram of the


values for a certain variable: we can use the hist() function
Let’s create the histogram for each Column.
For more detail to create the histogram of values for sepal length
hist(iris$Sepal.Length, col =′ steelblue′ , main =′ Histogram′ , xlab =′ SepalLength′ , ylab =′
F requency ′ )

42
Figure 13: Histogram of column Sepal length of data set IRIS

To create a scatterplot of any pairwise combination of variables, we can also use


the plot() function:
For scatterplot of sepal width vs. sepal length
plot(iris.Width, irisSepal.Length, col =′ steelblue′ , main =′ Scatterplot′ , xlab =′ SepalW idth′ , ylab =
SepalLength′ , pch = 19)

43
Figure 14: Scatter diagram columns Sepal Width vs Sepal Length of data set IRIS

Q: Repeat the above process for the data set of airquality, cars , iris3, quakes etc ?
To create a boxplot by group, We can also use the command of boxplot() function
For boxplot Sepal.Length vs. Species, use the command
boxplot(Sepal.Length Species, data = iris, main =′ SepalLengthbySpecies′ , xlab =′ Species′ , ylab =′
SepalLength′ , col =′ steelblue′ , border =′ black ′ )

44
Figure 15: Boxplot of Sepal Length vs Species of data set IRIS

– The x-axis displays the three species and the y-axis displays the distribution
of values for sepal length for each species.
– This type of plot allows us to quickly see that the sepal length tends to be
largest for the virginica species and smallest for the setosa species.

45
3.9 How to import the data

The most common file, we needed to import in R are

– Excel file
– CSV file; It is named as a comma-separated values file. It is a delimited text
file that uses a comma to separate values. Each line of the file is a data record.
– Tab-delimited text; It is a file containing tabs that separate information with
one record per line. A tab-delimited file is often used to upload data to a
system. The most common program used to create these files is Microsoft
Excel.
Some of examples to import files are
– x = read excel(“D : /IM E−M A−240+L−P robandStat/R−Language/csvorexcelData/LungCa
– y = read excel(”D : /IM E−M A−240+L−P robandStat/R−Language/csvorexcelData/smoker

Some of the examples

– To execute the data of cars, > data(cars) > str(cars)


– To view the data , use > View(data)
– To work on the rows of data values of cars, use
> cars$speed
> cars$dist

– To find the measure of central tendencies, one can use


> mean(cars$speed)
> mean(cars$speed, trim=0.05)
> sort(cars$speed)
> mean(cars$dist)
> median(cars$speed)

– To find the measure of variations, one can use


> summary(cars$speed)

46
> fivenum(cars$speed)
> quantile(cars$speed)
> quantile(cars$speed,0.5)
> quantile(cars$speed,0.75)
> quantile(cars$speed,0.95)
> IQR(cars$speed)
> range(cars$speed)
> min(cars$speed)
> max(cars$speed)
> var(cars$speed)
> sd(cars$speed)
> sqrt(var(cars$speed))

– > data() You will get a family of data set frames where one of the data is
related to cars. Students, please view it.

3.10 Linear regression

Linear regression is a regression model that uses a straight line to describe the
relationship between variables. It finds the line of best fit through your data by
searching for the value of the regression coefficient(s) that minimizes the total er-
ror of the model.

There are two main types of linear regression:


Simple linear regression uses only one independent variable.
Multiple linear regression uses two or more independent variables.
Today let’s re-create two variables and see how to plot them and include a re-
gression line. We consider the different loads to be a variable X that is useful to
measure the stiffness of the spring i.e. the length of the spring denoted by variable
Y.

X 3 5 6 9 10 12 15 20 22 28
Y 10 12 15 18 20 22 27 30 32 34

47
First, create these two variables in the R-console window or in the script file.
We can now create a simple plot of the two variables as follows:
plot(X, Y)

Figure 16: Scatter Plot

We can improve this plot by using various arguments within the plot() command.
Copy and paste the following code into the R workspace:
plot(X,Y, pch = 16, cex = 1.3, col = ”blue”, main = ”Testing of experimental
data”, xlab = ”Loads”, ylab =”Length”)

48
Figure 17: Improved Scatter Plot

In the above code, the syntax pch = 16 creates solid dots, while cex = 1.3 creates
dots that are 1.3 times bigger than the default (where cex = 1). More about these
commands later.
Now let’s perform a linear regression using lm() on the two variables by adding
the following text at the command line: lm(Y X)

49
Figure 18: Results of Regression Line

Note that the intercept is 8.804 and the slope is 1.015. And, the way lm stands
for “linear model”.
To add a best-fit line or regression line to our plot by adding the following text at
the command line:
abline(y-intercept,slope)
abline(8.804, 1.015)

50
Figure 19: Graph of Regression Line

We can also try the syntax ”abline(lm(Y X))” useful to plot the regression line.
Now we can use several R diagnostic plots and influence statistics to diagnose how
well our model is fitting the data. These diagnostic plots include:

– Residuals vs. fitted values


– Q-Q plots
– Scale Location plots
– Cook’s distance plots.

To use R’s regression diagnostic plots, we set up the regression model as an object
and create a plotting environment of two rows and two columns. Then we use the
plot() command, treating the model as an argument.

51
model =lm(Y X)
par(mfrow = c(2,2))
par(bg = gray(.9)) Optional command to create a light gray background
plot(model)
Note: par(mfrow) is useful to arrange multiple plots in the same plotting space.
par(mfrow = c(2,2)) means multiple plots with two rows and two columns. The
mfrow and mfcol parameters allow you to create a matrix of plots in one plotting
space.
In the next step, we will walk you through linear regression in R by using a sample
dataset named income.data. Some of the shots to import files are mentioned below.

Figure 20: How to import the csv file

52
Figure 21: How to import the csv file

53
Figure 22: How to import the csv file

54
Figure 23: How to import the csv file

Q. Find out the scatter plot, and regression coefficients, and fit a line (an approx-
imation) to the above data to explore the relationship between the variables of
income and happiness. Also, find the correlation coefficients between these vari-
ables by using the command cor(A, B) and then interpret. Also use the commands
model =lm(Y X)
par(mfrow = c(2,2))
plot(model) we get the complete layout as

55
Figure 24: Layout

56

You might also like