R Manual
R Manual
6. Click on Run
7. Choose language of your choice
8. Click on Next
9. Select the appropriate folder to install and click next
10. Check for options and then click Next
11. Use default settings while installing
12. Click next to install
R PROGRAMMING
R is an interpreted computer programming language which was created by Ross Ihaka and
Robert Gentleman at the University of Auckland, New Zealand.
“The R Development Core Team” currently develops R. It is also a software environment used to
analyze statistical information, graphical representation, reporting, and data modeling.
This programming language name is taken from the name of both the developers. The first
project was considered in 1992. The initial version was released in 1995, and in 2000, a stable
beta version was released.
Version-
Date Description
Release
0.49 1997-04-23 First Time R's Source Was Released, And CRAN
(Comprehensive R Archive Network) Was Started.
2.13 2011-04-14 Added a function that rapidly converts code to byte code.
Features of R programming
R is a domain-specific programming language which aims to do data analysis. It has some
unique features which make it very powerful. The most important arguably being the notation of
vectors. These vectors allow us to perform a complex operation on a set of values in a single
command. There are the following features of R programming:
1. It is a simple and effective programming language which has been well developed.
2. It is data analysis software.
3. It is a well-designed, easy, and effective language which has the concepts of user-defined,
looping, conditional, and various I/O facilities.
4. It has a consistent and incorporated set of tools which are used for data analysis.
5. For different types of calculation on arrays, lists and vectors, R has a suite of operators.
6. It provides effective data handling and storage facility.
7. It is open-source, powerful, and highly extensible software.
8. It provides highly extensible graphical techniques.
9. It allows us to perform multiple calculations using vectors.
10. R is an interpreted language.
Comparison between R and Python
A variable is a named memory location where data is stored. Variables are used to store the
information to be manipulated and referenced in the R program
Data Types in R Programming Language
Each variable in R has an associated data type. Each R-Data Type requires different amounts of
memory and has some specific operations which can be performed over it.
Decimal values are called numeric’s in R. It is the default R data type for numbers in R. If you
assign a decimal value to a variable x as follows, x will be of numeric type.
OUTPUT
[1] “numeric”
[1] “double”
//When R stores a number in a variable, it converts the number into a “double” value or a
decimal type with at least two decimal places. This means that a value such as “5” here, is
stored as 5.00 with a type of double and a class of numeric .//
------------------------------------------------------------------------------------------------------------------
# is y an integer?
> print(is.integer(y))
OUTPUT
[1] “numeric”
[1] FALSE
------------------------------------------------------------------------------------------------------------------
R supports integer data types which are the set of all integers. The capital ‘L’ notation is used
as a suffix to denote that a particular value is of the integer R data type.
OUTPUT
[1] “integer”
[1] “integer”
------------------------------------------------------------------------------------------------------------------
You can create as well as convert a value into an integer type using the as.integer() function.
OUTPUT
[1] “integer”
[1] “integer”
------------------------------------------------------------------------------------------------------------------
Logical Data type
R has logical data types that take either a value of true or false. A logical value is often created
via a comparison between variables. Boolean values, which have two possible values, are
represented by this R data type: FALSE or TRUE
# Sample values
>x=4
>y=3
OUTPUT
[1] “logical”
[1] “logical”
------------------------------------------------------------------------------------------------------------------
Complex Data type
R supports complex data types that are set of all the complex numbers. The complex data type
is to store numbers with an imaginary component.
OUTPUT
[1] “complex”
[1] “complex”
------------------------------------------------------------------------------------------------------------------
Character Data type
R supports character data types where you have all the alphabets and special characters. It
stores character values or strings. Strings in R can contain alphabets, numbers, and symbols.
The easiest way to denote that a value is of character type in R data type is to wrap the value
inside single or double inverted commas.
OUTPUT
[1] “character”
[1] “character”
------------------------------------------------------------------------------------------------------------------
Find data type of an object
To find the data type of an object you have to use class ( ) function. The syntax for doing that
is you need to pass the object as an argument to the function class ( ) to find the data type of an
object.
Syntax:
class(object)
# Logical
> print(class(TRUE))
# Integer
> print(class(3L))
# Numeric
> print(class(10.5))
# Complex
> print(class(1+2i))
# Character
> print(class("01-01-2024"))
OUTPUT
[1] “logical
[1] “integer”
[1] “numeric”
[1] “complex”
[1] “character”
------------------------------------------------------------------------------------------------------------------
Type verification
To verify, you need to use the prefix “ is.” before the data type as a command. The syntax for
that is, is.datatype( ) of the object you have to verify.
# Logical
> print(is.logical(TRUE))
# Integer
> print(is.integer(3L))
# Numeric
> print(is.numeric(10.5))
# Complex
> print(is.complex(1+2i))
# Character
> print(is.character("01-01-2024"))
> print(is.integer("a"))
> print(is.numeric(2+3i))
OUTPUT
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] FALSE
[1] FALSE
------------------------------------------------------------------------------------------------------------------
Type Conversion
You can convert from one type to another with the following functions:
as.numeric( )
as.integer ( )
as.complex ( )
Example Program
x <- 1L # integer
y <- 2 # numeric
# convert from integer to numeric:
a <- as.numeric(x)
# convert from numeric to integer:
b <- as.integer(y)
# print values of x and y
x
y
# print the class name of a and b
class(a)
class(b)
Output
[1] 1
[1] 2
[1] “numeric”
[1] “integer”
WEEK-2
1. To output text in R, use single or double quotes:
"Hello World! “
‘Hello …… ‘
5
10
25
5+5
R Print Output:
Unlike many other programming languages, you can output code in R without using a print
function:
"Hello World!“
And there are times you must use the print ( ) function to output code, for example when
working with for loops
for (x in 1:10)
{
print(x)
}
R Variables
Variables are containers for storing data values. R does not have a command for declaring a
variable. A variable is created the moment you first assign a value to it.
In other programming language, it is common to use = as an assignment operator.
In R, we can use both = and <- as assignment operators.
To assign a value to a variable, use the <- sign.
PROGRAM
name <- "John"
age <- 40
print(name)
print(age)
OUTPUT
[1] "John"
[1] 40
R Operators
Operators are used to perform operations on variables and values. An operator is a symbol
which tells the compiler to perform specific logical or mathematical manipulations.
R programming is very rich in built-in operators . R divides the operators in the following
groups:
Arithmetic operators
Assignment operators
Comparison operators
Logical operators
Miscellaneous operators
Arithmetic Operators
1. <- These operators are known as left a <- c(3, 0, TRUE, 2+2i)
or assignment operators. b <<- c(2,4, TRUE, 3)
d = c(1, 2, FALSE, 7)
= print(a)
or print(b)
<<- print(d)
Logical Operators
This operator is called the Element wise Logical a <- c(3, 0, TRUE, 2
OR operator. This operator takes the first b <- c(2, 4, TRUE, 3)
2. | element of both the vector and returns TRUE if print(a|b)
one of them is TRUE.
This operator takes the first element of both the a <- c(3, 0, TRUE, 2)
4. && vector and gives TRUE as a result, only if both b <- c(2, 4, TRUE, 2+3i)
are TRUE. print(a&&b)
This operator takes the first element of both the a <- c(3, 0, FALSE, 2)
vector and gives the result TRUE, if one of them b <- c(2, 4, TRUE, 2+3i)
5. || print(a||b)
is true.
Miscellaneous Operators
v <- 1:8
The colon operator is used to create the series print(v)
1. :
of numbers in sequence for a vector.
a1 <- 8
a2 <- 12
This is used when we want to identify if an
2. %in% d <- 1:10
element belongs to a vector. print(a1%in%d)
print(a2%in%d)
b) Implement R script to read person's age from keyboard and display whether he is
eligible for voting or not
Program:1
{
age <- as.integer(readline(prompt = "Enter your age :"))
if (age >= 18){
print(paste("You are valid for voting :", age))
} else{
print(paste("You are not valid for voting :", age))
}
}
The readline( ) function in R Language reads text lines from an input file. This is perfect for
text files since it reads the text line by line
The paste( ) : Takes multiple elements from the multiple vectors and concatenates them into a
single element.
Program:2
Program: 1
WEEK-3
List:
1. List of strings
2. Access Lists
You can access the list items by referring to its index number, inside brackets [ ]
3. List Length
To find out how many items a list has, use the length( ) function:
To add an item to the end of the list, use the append( ) function:
x <- list("apple", "banana", "cherry")
x
z=append(x ,"orange“)
z
The most common way is to use the c( ) function, which combines two elements together:
Access components by names: All the components of a list can be named and we can use
those names to access the components of the R list using the dollar command.
Access components by indices: We can also access the components of the R list using
indices. To access the top-level components of a R list we have to use a double slicing
operator “[[ ]]” which is two square brackets and if we want to access the lower or inner-
level components of a R list we have to use another square bracket “[ ]” along with the
double slicing operator “[[ ]]“.
The concatenated text will be displayed to the console by the cat( ) function, but the results won't
be saved in a variable. The concatenated string will be written to the console using the paste( )
function, and the results will be saved in a character variable
Program: 1
EmpId = c(1, 2, 3, 4)
EmpName = c("Raju", "Sandeep", "Subham", "Rani")
NumberofEmp = 4
EmpList=list("ID" = EmpId,"Names" = EmpName,"TotalStaff" =NumberOfEmp)
print(EmpList)
cat("Accessing name components using $ command\n")
print(EmpList$Names)
print(EmpList$ID)
print(EmpList$TotalStaff)
Program: 2
# Accessing a top level components by indices
paste("Accessing name components using indices\n")
print(EmpList[[2]])
# Accessing a inner level components by indices
paste("Accessing Sandeep from name using indices\n")
print(EmpList[[2]][2])
# Accessing another inner level components by indices
paste("Accessing 4 from ID using indices\n")
print(EmpList[[1]][4])
R provided two inbuilt functions named c( ) and append( ) to combine two or more
lists.
Method 1: Using c( ) function
c( ) function in R language accepts two or more lists as parameters and returns another list with
the elements of both the lists.
Syntax: c(list1, list2)
Method 2: Using append( ) function
append( ) function in R language accepts two or more lists as parameters and returns another list
with the elements of both the lists.
Syntax: append(list1, list2)
Program:
List1 <- list(1:5)
List1
List2 <- list(6:10)
List2
List3 = append(List1, List2)
print(List3)
Program:
EmpId = c(1, 2, 3, 4)
EmpName = c("Raju", "Sandeep", "Subham", "Rani")
NumberofEmp = 4
EmpList=list("ID" = EmpId,"Names" = EmpName,"TotalStaff"=NumberofEmp)
Print(“BEFORE MERGING TWO LISTS”)
print(EmpList)
Empage=c(35,45,55,25)
EmpageList=list("AGE"=Empage)
Emp=c(EmpList,Empage)
Print(“AFTER MERGING TWO LISTS”)
Print(Emp)
Matrices
Operations on Matrices
There are four basic operations i.e. DMAS (Division, Multiplication, Addition, Subtraction)
that can be done with matrices. Both the matrices involved in the operation should have the
same number of rows and columns.
Program:
B = matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, ncol = 3)
C = matrix(c(7, 8, 9, 10, 11, 12), nrow = 2, ncol = 3)
print(B)
print(C)
print(B + C)
print(B-C)
print(B*C)
print(B/C)
print t(B)
WEEK- 4
Implement R Script to perform various operations on Vectors
Vectors:
A vector is simply a list of items that are of the same type.
To combine the list of items to a vector, use the c( ) function and separate the items by a comma.
Example:
# Vector of strings
Vector Length
To find out how many items a vector has, use the length( ) function:
Example:
fruits <- c("banana", "apple", "orange")
length(fruits)
Sort a Vector
To sort items in a vector alphabetically or numerically, use the sort( ) function:
Example:
fruits <- c("banana", "apple", "orange", "mango", "lemon")
numbers <- c(13, 3, 5, 7, 20, 2)
sort(fruits)
sort(numbers)
Access Vectors
You can access the vector items by referring to its index number inside brackets [ ]
Example:
# Access the element using position number
X<-c(7,2,5,6,1,9)
X[3]
# Access the 1st and 3rd item (banana and orange)
fruits <- c("banana", "apple", "orange", "mango", "lemon")
fruits[c(1, 3)]
# Logical indexing
Z=c(7,8,1,2,6,9)
Z[Z>3]
Change an Item
Example
fruits <- c("banana", "apple", "orange", "mango", "lemon")
# Change "banana" to "pear"
fruits[1] <- "pear"
fruits
Repeat Vectors
To repeat vectors, use the rep( ) function:
1) Repeat each value
Example: x <- rep(c(1,2,3), each = 3)
x
2) Repeat the sequence of the vector:
Example: x<- rep(c(1,2,3), times = 3)
x
3) Repeat each value independently:
Example: x<- rep(c(1,2,3), times = 3)
x
Arithmetic operations
The arithmetic operations are performed member-by-member on vectors. We can add, subtract,
multiply, or divide two vectors.
Example: a<-c(1,3,5,7)
b<-c(2,4,6,8)
x=a+b
y=a-b
z=a/b
w=a*b
x
y
z
w
Deleting a R vector
Deletion of a Vector is the process of deleting all of the elements of the vector. This can be done
by assigning it to a NULL value.
Example:
# Creating a Vector
M<- c(8, 10, 2, 5)
# set NULL to the vector
M<- NULL
cat('Output vector', M)
Find the Sum, Mean and Product of a vector in R
sum( ), mean( ), and prod( ) methods are available in R which are used to compute the
specified operation over the arguments specified in the method. In case, a single vector is
specified, then the operation is performed over individual elements, which is equivalent to the
application of for loop.
Example:
vec = c(1, 2, 3 , 4)
print("Sum of the vector:")
print(sum(vec))
print("Mean of the vector:")
print(mean(vec))
print("Product of the vector:")
print(prod(vec))
Example:
vec = c(1.1,NA, 2, 3.0,NA )
print("Sum of the vector:")
print(sum(vec,na.rm = TRUE))
print("Mean of the vector with NaN values:")
print(mean(vec))
print("Mean of the vector without NaN values:")
print(mean(vec,na.rm = FALSE))
print("Product of the vector:")
print(prod(vec,na.rm = TRUE))
How to Find Min and Max Values Using the Range Function
Range in R returns a vector that contains the minimum and maximum values of the given
argument — known in statistics as a range.
Example: x=c(-10,-15,5,19,27,0)
range(x)
Example: x=c(-10,-15,5,NA,19,27,NA,0)
range(x,na.rm=TRUE)
Arrays
Arrays have more than two dimensions.
We can use the array( ) function to create an array, and the dim parameter to specify the
dimensions.
How does dim=c(4,3,2) work?
The first and second number in the bracket specifies the amount of rows and columns.
The last number in the bracket specifies how many dimensions we want.
Example: x <- c(1:24)
y<- array(x, dim=c(4,3,2))
y
Access Array Items
You can access the array elements by referring to the index position.
You can use the [ ] brackets to access the desired elements from an array
The syntax is as follow: array[row position, column position, matrix level]
Example: x <- c(1:24)
y <- array(x, dim = c(4, 3, 2))
y[2, 3, 2]
dimnames = list(row.names,column.names,matrix.names))
print(result)
In R Programming Language range( ) function is used to get the minimum and maximum
values of the vector passed to it as an argument.
# Creating a vector
x <- c(8, 2, Inf, 5, 4, NA, 9, 54, 18)
# Calling range( ) function
range(x)
# Calling range() function
# excluding NA values
range(x, na.rm = TRUE)
# Calling range( ) function
# excluding finite values
range(x, na.rm = TRUE, finite = TRUE)
WEEK-5
a) Implement R script to perform various operations on matrices
Create a matrix
R provides the matrix( ) function to create a matrix.
This function plays an important role in data analysis.
There is the following syntax of the matrix in R:
matrix(data, nrow, ncol, byrow, dim_name)
Access a matrix
1. We can access the element which presents on nth row and mth column.
2. We can access all the elements of the matrix which are present on the nth row.
3. We can also access all the elements of the matrix which are present on the mth column.
Dataframe
A data frame is a two-dimensional array-like structure or a table in which a column contains
values of one variable, and rows contains one set of values from each column. A data frame is a
special case of the list in which each component has equal length.
A data frame is used to store data table and the vectors which are present in the form of a list in
a data frame
In R, the data frames are created with the help of data.frame ( ) function of data. This function
contains the vectors of any type such as numeric, character, or integer.
Creating the data frame.
emp_data=data.frame( employee_id = c (1:5),
employee_name = c("Shubham","Arpita","Nishka","Gunjan","Sumit"),
sal = c(623.3,915.2,611.0,729.0,843.25),
starting_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
"2015-03-27"),))
print(emp_data)
R provides an in-build function called str( ) which returns the data with its complete structure.
str(emp_data)
1. We can extract the specific columns from a data frame using the column name.
2. We can extract the specific rows also from a data frame.
3. We can extract the specific rows corresponding to specific columns.
Extracting 2nd and 3rd row corresponding to the 1st and 4th column
final <- emp_data[c(2,3),c(1,4)]
print(final)
Week 6
a) Write an R script to find basic descriptive statistics using Summary, str, and Quartile
function on mtcars and cars datasets.
Let’s start simple with the summarizing functions str ( ) and summary ( ).
The str( ) function takes a single object as an argument and compactly shows us the structure of
the input object. It shows us details like length, data type, names and other specifics about the
components of the object.
Since the mtcars dataset is a built-in dataset in R, we can load it by using the following
command:
data(mtcars)
We can take a look at the first six rows of the dataset by using the head ( ) function:
head(mtcars)
We can use the summary ( ) function to quickly summarize each variable in the dataset:
summary(mtcars )
We can use the dim ( ) function to get the dimensions of the dataset in terms of number of rows
and number of columns:
dim( mtcars)
We can also use the names ( ) function to display the column names of the data frame:
names( mtcars)
We can use hist ( ) function to create a histogram of the values for a certain variable:
We can also use plot( ) function to create a scatter plot of any pair wise combination of variables:
mean(mtcars$mpg)
median(mtcars$mpg)
sd(mtcars$mpg)
var(mtcars$mpg)
mad(mtcars$mpg)
sum(mtcars$mpg)
length(mtcars$mpg)
Cumulative measures in R
Cumulative measures are statistical measures that are calculated sequentially.
These measures evolve with the data.
They provide insight into the progression and growth of the data.
R provides a few functions that calculate cumulative measures with ease. These functions are
Cumulative sum: The cumsum( ) function calculates the cumulative sum of a given vector.
Cumulative max: To find the cumulative maximum value of an input vector, you can use
the cummax( ) function.
Cumulative min: You can find the cumulative minimum values in a vector by using the cummin(
) function.
Cumulative product: Using the cumprod () function, you can find the cumulative product of a
vector.
a <- c(1:9,4,2,4,5:2)
cumsum(a)
cummax(a)
cummin(a)
cumprod(a)
Row and Column Summary Functions in R
There are certain functions in R that give summary statistics for only selected rows or
columns of data frames or matrices or any other two or more dimensional data structure.
These functions are:
RowMeans: The rowMeans( ) function returns the mean of a selected row of a data structure.
RowSums: The rowSums( ) function finds the sum of a selected row of a data structure.
ColMeans: The colMeans( ) function returns the mean of a selected column of a data structure.
ColSums: The colSums( ) function calculate the sum of a selected column of a data structure.
rowMeans(mtcars[2,])
rowSums(mtcars[2,])
colMeans(mtcars)
colSums(mtcars)
Sorting and Ordering the Data
The sort( ) and the order( ) functions are included in the base package of R and are used to sort
or order the data in the desired order.
1. The sort function
The sort( ) function sorts the elements of a vector or a factor in increasing or decreasing order.
The syntax of the sort function is:
Sort(x, decreasing = FALSE, na.last = NA)
x is the input vector or factor that has to be sorted.
decreasing is a boolean that controls whether the input vector or factor is to be sorted in
decreasing order (when set to TRUE) or in increasing order (when set to FALSE).
na.last is an argument that controls the treatment of the NA values present inside the input
vector/factor. If na.last is set as TRUE, then the NA values are put at the last. If it is set
as FALSE, then the NA values are put first. Finally, if it is set as NA, then the NA values
are removed.
sort(c(3,16,34,77,29,95,24,47,92,64,43), decreasing = FALSE)
There are multiple ways to make subsets of a dataset in R. Depending on the shape and size of
the subset, you can either use different operators to index certain parts of a dataset and assign
those parts to a variable. These operators are:
1. The $ operator
The $ sign can be used to access a single variable (column) of a dataset. The result of using this
notation is a single length vector.
2. The [[operator
The [[operator selects a single element like the $ notation. Unlike the $ operator, the [[operator
can be used by specifying the target position instead of the name of the target element.
3. The [operator
The [operator takes a numeric, character, or a logical vector to identify its target. This operator
returns multiple elements depending on the given target indices.
mtcars$hp
mtcars[[4]]
mtcars[4]
The sample function
The sample( ) function returns random samples of the given data. The arguments of the function
can be used to specify how big the samples need to be and also how many samples should be
returned.
sample(mtcars, 3)
Merging Datasets
There are multiple ways to merging/combining datasets in R. We will be taking a look at
the cbind( ), the rbind( ), and the merge() functions of R that allow us to do so.
1. The cbind function
The cbind() function combines two dataset (or data frames) along their columns.
m1 <- matrix(c(1:9),c(3,3))
m2 <- matrix(c(10:18),c(3,3))
cbind(m1,m2)
2. The rbind function
The rbind() function combines two data frames along their rows. If the two data frames have
identical variables, then rbind is the easiest way to combine them into one data frame with a
larger number of rows.
rbind(m1,m2)
3. The merge function
The merge( ) function performs what is called a join operation in databases. This function
combines two data frames based on common columns.
names <- c('v1','v2','v3')
colnames(m1) <- names
colnames(m2) <- names
merge(m1,m2, by = names, all = TRUE)
iris[iris$Species == "setosa", ]
WEEK-7
It’s also possible to choose a file interactively using the function file.choose( ), which I
recommend if you’re a beginner in R programming:
mydata <- read_excel (file.choose())
Specify sheet with a number or name
read.xlsx(file, sheetIndex, header=TRUE)
c) Another way to import Excel file after calling the read xl library is …
d) In R Studio, go to File>Import Dataset>from Excel and browse…
a) Implement R Script to create a Pie chart, Bar Chart, scatter plot and Histogram
Pie chart
A pie chart is a circular statistical graphic, which is divided into slices to illustrate numerical
proportions. It depicts a special chart that uses “pie slices”, where each sector shows the relative
sizes of data. A circular chart cuts in the form of radii into segments describing relative
frequencies or magnitude also known as a circle graph.
R Programming Language uses the function pie( ) to create pie charts.
It takes positive numbers as a vector input.
Syntax: pie(x, labels, radius, main, col, clockwise)
x: This parameter is a vector that contains the numeric values which are used in the pie chart.
labels: This parameter gives the description to the slices in pie chart.
radius: This parameter is used to indicate the radius of the circle of the pie chart.(value
between -1 and +1).
main: This parameter is represents title of the pie chart.
clockwise: This parameter contains the logical value which indicates whether the slices are
drawn clockwise or in anti clockwise direction.
col: This parameter give colors to the pie in the graph.
Examples
slices <- c(10, 12,4, 16, 8)
lbls <- c("US", "UK", "Australia", "Germany", "France")
pie(slices, labels = lbls, main="Pie Chart of Countries", col =rainbow(length(lbls))))
The start angle of the pie chart with the init.angle parameter.
The value of init.angle is defined with angle in degrees, where default angle is 0.
x <- c(10,20,30,40)
pie(x, init.angle = 90)
To add a list of explanation for each pie, use the legend() function:
mylabel <- c("Apples", "Bananas", "Cherries", "Dates")
colors <- c("blue", "yellow", "green", "black")
pie(x, label = mylabel, main = "Pie Chart", col = colors)
legend("bottomright", mylabel, fill = colors)
Bar Plots
A bar chart represents data in rectangular bars with length of the bar proportional to the value of
the variable. R uses the function barplot( ) to create bar charts. R can draw both vertical and
horizontal bars in the bar chart. In bar chart each of the bars can be given different colors.
barplot(H,xlab,ylab,main, names.arg,col)
H <- c(7,12,28,3,41)
M <- c("Mar","Apr","May","Jun","Jul")
barplot(H,names.arg=M,xlab="Month",ylab="Revenue",col="blue",
main="Revenue chart")
barplot(H,names.arg=M,xlab="Month",ylab="Revenue",col="blue",
main="Revenue chart",border="red")
To create a horizontal bar chart:
barplot(A, horiz=FALSE )
Histograms
A histogram contains a rectangular area to display the statistical information which is
proportional to the frequency of a variable and its width in successive numerical intervals. A
graphical representation that manages a group of data points into different specified ranges. It has
a special feature that shows no gaps between the bars and is similar to a vertical bar graph.
R Programming Language using the hist( ) function.
Syntax: hist(v, main, xlab,ylab, xlim, ylim, breaks, col, border)
Example:
v <- c(19, 23, 11, 5, 16, 21, 32,14, 19, 27, 39)
hist(v,xlab ="No.of Articles ",col = "green", border = "black")
hist(v,xlab="No.ofArticles",ylab="frequency",col = "red",border="white")
Scatter Plot
A "scatter plot" is a type of plot used to display the relationship between two numerical variables,
and plots one dot for each observation.
It needs two vectors of same length, one for the x-axis (horizontal) and one for the y-axis (vertical):
Example1
x<-c(5,7,8,7,2,2,9,4,11,12,9,6)
y<- c(99,86,87,88,111,103,87,94,78,77,85,86)
plot(x, y)
plot(x,y,main="Observation of Cars", xlab="Car age", ylab="Car speed")
Example2
input <- mtcars[, c('wt', 'mpg')]
print(head(input))
plot(x = input$wt,y = input$mpg,xlab = "Weight",ylab = "Milage",
xlim = c(1.5, 4), ylim = c(10, 25),main = "Weight vs Milage")
b) Mean in R Programming Language
It is the sum of observations divided by the total number of observations.
It is also defined as average which is the sum divided by count.
x <- read.csv("C:/Users/CSE/Downloads/CardioGoodFitness.csv")
view(x)
head(x)
mean = mean(x$Age)
mean
med = median(x$Age)
med
Example
x <- c(1, 2, NA, 4, 5, NA, 7, 8, NA, 9, 10)
Variance
Variance is the sum of squares of differences between all numbers and means
One can calculate the variance by using var() function in R.
Syntax: var(x)
list = c(2, 4, 4, 4, 5, 5, 7, 9)
print(var(list))
Standard Deviation
Standard Deviation is the square root of variance. It is a measure of the extent to which data
varies from the mean.
Syntax: sd(x)
list = c(2, 4, 4, 4, 5, 5, 7, 9)
print(sd(list))
list1= c(290, 124, 127, 899)
print(sd(list1))
WEEK -9
Normal Distribution
Normal Distribution is a probability function used in statistics that tells about how the data
values are distributed. It is the most important probability distribution function used in
statistics.
It is generally observed that data distribution is normal when there is a random collection of
data from independent sources.
The graph produced after plotting the value of the variable on x-axis and count of the value on
y-axis is bell-shaped curve graph. The graph signifies that the peak point is the mean of the
data set and half of the values of data set lie on the left side of the mean and other half lies
on the right part of the mean telling about the distribution of the values. The graph is
symmetric distribution.
In R, there are 4 built-in functions to generate normal distribution:
dnorm(x, mean, sd)
dnorm( ) function in R programming measures density function of distribution
pnorm(x, mean, sd)
pnorm( ) function is the cumulative distribution function which measures the probability that a
random number X takes a value less than or equal to x
qnorm(p, mean, sd)
qnorm( ) function is the inverse of pnorm( ) function.
It takes the probability value and gives output which corresponds to the probability value.
rnorm(n, mean, sd)
rnorm( ) function in R programming is used to generate a vector of random numbers which are
normally distributed.
x represents the data set of values
mean(x) represents the mean of data set x. It’s default value is 0.
n is the number of observations.
p is vector of probabilities
Examples
x = seq(-15, 15, by=0.1)
x
y = dnorm(x, mean(x), sd(x))
plot(x, y)
y = pnorm(x, mean(x), sd(x))
plot(x, y)
y = qnorm(x, mean(x), sd(x))
plot(x, y)
y = rnorm(x, mean(x), sd(x))
plot(x, y)
y=rnorm(50)
hist(y,main="Normal Distribution")
Binomial Distribution
Example:
dbinom(3, size = 13, prob = 1/6)
probabilities <- dbinom(x = c(0:10), size = 10, prob = 1 / 6)
plot(0:10, probabilities, type = "l")
data.frame(probabilities)
2. pbinom(k, n, p)
The function pbinom( ) is used to find the cumulative probability of a data following binomial
distribution till a given value ie it finds
where n is total number of trials, p is probability of success, k is the value at which the
probability has to be found out.
Linear Regression
Linear Regression is a commonly used type of predictive analysis. Regression analysis is a
very widely used statistical tool to establish a relationship model between two variables.
One of these variable is called predictor variable whose value is gathered through
experiments.
The other variable is called response variable whose value is derived from the predictor
variable.
There are two types of linear regression.
Simple Linear Regression
Multiple Linear Regression
A simple linear regression aims to model the relationship between the magnitude of a single
independent variable X and a dependent variable Y by trying to estimate exactly how
much Y will change when X changes by a certain amount.
The independent variable X, also called the predictor, is the variable used to make the
prediction.
The dependent variable Y, also known as the response, is the one we are trying to predict.
The general mathematical equation for a linear regression is
y = ax + b
Following is the description of the parameters used −
y is the response variable.
x is the predictor variable.
a and b are constants which are called the coefficients.
Steps to Establish a Regression
A simple example of regression is predicting weight of a person when his height is known.
Create a relationship model using the lm( ) functions in R.
Get a summary of the relationship model to know the average error in prediction. Also
called residuals.
To predict the weight of new persons, use the predict( ) function in R.
Example
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
relation <- lm(y~x)
print(relation)
print(summary(relation))
a <- data.frame(x = 170)
result <- predict(relation,a)
print(result)
Multiple regression is an extension of linear regression into relationship between more than
two variables. In simple linear relation we have one predictor and one response variable, but in
multiple regression we have more than one predictor variable and one response variable.
Time Series Analysis in R is used to see how an object behaves over some time. It can be
easily done by the ts( ) function with some parameters. Time series takes the data vector and
each data is connected with a timestamp value as given by the user. It is used to learn and
forecast the behavior of an asset in business for a while.
Syntax: objectName <- ts(data, start, end, frequency)
where,
data – represents the data vector
start – represents the first observation in time series
end – represents the last observation in time series
frequency – represents number of observations per unit time. For example, frequency=1 for
monthly data.
Example:
Weekly data of COVID-19 positive cases from 22 January, 2020 to 15 April, 2020
x <- c(580, 7813, 28266, 59287, 75700,87820, 95314, 126214, 218843, 471497, 936851,
1508725, 2072113)
install.packages("lubridate")
library(lubridate)
mts <- ts(x, start = decimal_date(ymd("2020-01-22")), frequency = 365.25 / 7)
plot(mts, xlab ="Weekly Data",ylab ="Total Positive Cases",main ="COVID-19
Pandemic",col.main ="darkgreen")
Data transformation is the process of cleaning and organizing data from one format into
another. It’s one of the key aspects of work for data analysis, data science and even artificial
intelligence.
Factors are data structures that are implemented to categorize the data or represent categorical
data and store it on multiple levels. They can be stored as integers with a corresponding label to
every unique integer. Though factors may look similar to character vectors, they are integers.
Convert the data vector into a factor.
The factor( ) command is used to create and modify factors in R.
Example
v =c(1,2,3,3,4, NA,3,2,4,5, NA,5)
print("Original vector:")
print(v)
print(factor(v))
print("Levels of factor of the said vector:")
print(levels(factor(v)))
Example
V = c("North", "South", "East", "East", "West", "South", "North")
drn <- factor(V)
drn
Date Operations
Dates in R
1.Get the system date
Sys.Date( )
Sys.time ( )
format(date,format="%a")
Specifier Description
%a Abbreviated weekday
%A Full weekday
%b Abbreviated month
%B Full month
%C Century
Specifier Description
The as.Date( ) function handles dates in R without time. This function takes the date as a String
in the format YYYY-MM-DD or YYY/MM/DD and internally represents it as the number of
days
x <- as.Date("2024-01-01")
x
y <- as.Date("2024-01-10")
y
range=seq(x,y,"days")
range
install.packages("lubridate")
library(lubridate)
x <- ymd("2024-01-01")
y <- ymd("2024-01-10")
range=seq(x,y,"days")
range
x <-dmy("01-04-2024")
y <-dmy("10-04-2024")
range=seq(x,y,"days")
range
WEEK-11
Missing Data
In R, the NA symbol is used to define the missing values, and to represent impossible arithmetic
operations (like dividing by zero). we use the NAN symbol which stands for “not a number”. In
simple words, we can say that both NA or NAN symbols represent missing values in R.
Finding Missing Data in R
R provides us with inbuilt functions using which we can find the missing values.
Using the is.na( ) Function. This function returns a vector that contains only logical value (either
True or False).
Example:
1. x <- c(NA, "TP", 4, 6.7, 'c', NA, 12)
x
is.na(x)
which(is.na(x))
sum(is.na(x))
2. y<- c(NA, 100, 241, NA, 0 / 0, 101, 0 / 0)
y
is.nan(y)
Remove Values Using Filter functions
na.omit( ) − It simply rules out any rows that contain any missing value and forgets those rows
na.exclude( ) − This arugment ignores rows having at least one missing value.
na.pass( ) − Take no action.
na.fail( ) − It terminates the execution if any of the missing values are found.
Example:
1. na.exclude(x)
na.exclude(y)
na.omit(x)
2. data <- data.frame(A = c(1, 2, NA, 4, 5),B = c(NA, 2, 3, NA, 5),
C = c(1, 2, 3, NA, NA))
data
is.na(data)
sum(is.na(data))
Identify and Remove Duplicate Data in R
We can use duplicated( ) function to find out how many duplicates value are present in a vector
and unique( ) to remove duplicate values.
Example:
a <- c(1, 2, 3, 4, 4, 5)
duplicated(a)
sum(duplicated(a))
unique(a)
Example:
s=data.frame(name=c("Ram","Geeta","John","Paul", "Cassie","Geeta","Paul"),
maths=c(7,8,8,9,10,8,9),
science=c(5,7,6,8,9,7,8),
history=c(7,7,7,7,7,7,7))
s
duplicated(s)
sum(duplicated(s))
unique(s)
duplicated(s$maths)
unique(s$maths)
Spelling Check
install.packages("dplyr")
install.packages("stringr")
install.packages("quanteda")
install.packages("hunspell")
install.packages("flextable")
library(dplyr)
library(stringr)
library(quanteda)
library(hunspell)
library(flextable)
Example:
install.packages("RSQLite")
library(RSQLite)
con <- dbConnect(SQLite(), 'play-example.db')
con
dbWriteTable(con, 'cars', mtcars)
dbListTables(con)
dbGetQuery(con, 'SELECT * FROM cars ')
dbGetQuery(con, 'SELECT * FROM cars LIMIT 5')
dbGetQuery(con, 'SELECT mpg, cyl FROM cars WHERE mpg>30 ORDER BY mpg')
Loading SPSS (Statistical Package for the Social Sciences)
SAS (Statistical Analysis Software) files
The easiest way to import SPSS files into R is to use the read_sav() function from
the haven library.
install.packages('haven')
library(haven)
data <- read_sav('C:/Users/User_Name/file_name.sav')
data <- read_sas('C:/Users/User_Name/file_name.sas7bdat')