Data Analysis Using R Programming_f4bbedee6eb42e53b3cde0028f27ba5b
Data Analysis Using R Programming_f4bbedee6eb42e53b3cde0028f27ba5b
R Programming
BY
Dr. Anisha P Rodrigues
Associate Professor
Department of Computer Science and Engineering
NMAMIT,Nitte
Dr. Anisha P Rodrigues, Dept of
9/21/2024 CSE,NMAMIT, Nitte
Topics
◼ Features of R
◼ Environment setup with RStudio
◼ R Commands
◼ R Script file
◼ Variables and Data Types
◼ Operators
◼ Decision making, Loops, Strings, Vectors, Lists,
Matrices, Arrays, Factors,Data Frames
◼ Functions
◼ R packages
◼ Data Re-shaping
◼ https://www.mycompiler.io/online-r-compiler
◼ https://www.w3schools.com/r/r_compiler.asp
Windows Installation
◼ Go to the CRAN(Comprehensive R Archive
Network) website.
❑ https://cran.r-project.org/
◼ You can download the Windows installer
version of R and save it in a local directory.
◼ R-4.3.1 for Windows
❑ print ( myString)
◼ $ Rscript test.R
❑ # Create a vector.
❑ apple <- c('red','green',"yellow")
❑ print(apple)
❑ # Get the class of the vector.
❑ print(class(apple))
◼ Result:
❑ [1] "red" "green" "yellow"
❑ [1] "character"
❑ print(list1)
❑ print(list1[1])
❑ print(M)
nrow * ncol
◼ Example:
❑ # Create an array.
❑ a <- array(c('green','yellow'),dim=c(3,3,2))
❑ print(a)
◼ print(is.factor(data))
◼ print(factor_data)
◼ print(is.factor(factor_data))
Result:
❑ [1] "East" "West" "East" "North" "North" "East" "West" "West" "West" "East"
"North"
❑ [1] FALSE
❑ [1] East West East North North East West West West East North
❑ Levels: East North West
❑ [1] TRUE
❑ print(ls())
◼ scan()
◼ readline()
◼ Syntax:
seq(from, to, by, length.out)
◼ Parameters:
❑ from: Starting element of the sequence
❑ to: Ending element of the sequence
❑ by: Difference between the elements
❑ length.out: Maximum length of the vector
◼ Output:
[1] 1 3 5 7 9
[1] 1.0 2.5 4.0 5.5 7.0 8.5 10.0
◼ Parameters:
❑ x: Vector to be sorted
❑ decreasing: Boolean value to sort in descending order
◼ Output:
◼ [1] -8.0 -5.0 -4.0 1.2 3.0 4.0 6.0 7.0 9.0
◼ Parameter:
❑ x: Data object
# Create a vector
vec <- 1:5
vec
Vector Creation
◼ Even when you write just one value in R, it becomes a
vector of length 1 and belongs to one of the above vector
types.
◼ [1] 12.5
◼ [1] 63
◼ [1] TRUE
◼ [1] 2+3i
◼ [1] 68 65 6c 6c 6f
❑ Giving a negative value in the index drops that element from result.
◼ [1] "Sun“
◼ [1] "Mon" "Tue" "Wed" "Thurs" "Fri" "Sat“
◼ [1] "Mon" "Tue" "Fri"
Dr.Anisha P Rodrigues, Dept of
9/21/2024 CSE,NMAMIT, Nitte
◼ # Accessing vector elements using logical indexing.
v <- t[c(TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE)]
print(v)
◼ [1] -1 -3 4 -3 -1 9
◼ [1] 12 88 0 40 0 22
function.
v <- c(3,8,4,5,0,11, -9, 304)
# Sort the elements of the vector.
sort.result <- sort(v)
print(sort.result)
# Sort the elements in the reverse order.
revsort.result <- sort(v, decreasing = TRUE)
print(revsort.result)
# Sorting character vectors.
v <- c("Red","Blue","yellow","violet")
sort.result <- sort(v)
print(sort.result)
# Sorting character vectors in reverse order.
revsort.result <- sort(v, decreasing = TRUE)
print(revsort.result)
Dr.Anisha P Rodrigues, Dept of
9/21/2024 CSE,NMAMIT, Nitte
◼ When we execute the above code, it
produces the following result:
❑ [1] -9 0 3 4 5 8 11 304
❑ [1] 304 11 8 5 4 3 0 -9
❑ [1] "Blue" "Red" "violet" "yellow"
❑ [1] "yellow" "violet" "Red" "Blue"
◼ Creating a List
◼ Following is an example to create a list containing strings, numbers,
vectors and a logical values
◼ $A_Matrix
◼ [1,] 3 5 -2
◼ [2,] 9 1 8
◼ $A_Inner_list
◼ $A_Inner_list[[1]]
◼ [1] "green"
◼ $A_Inner_list[[2]]
◼ [1] 12.3
# Create lists.
list1 <- list(1:5)
print(list1)
list2 <-list(10:14)
print(list2)
# Convert the lists to vectors.
v1 <- unlist(list1)
v2 <- unlist(list2)
Dr.Anisha P Rodrigues, Dept of
9/21/2024 CSE,NMAMIT, Nitte
print(v1)
print(v2)
# Now add the vectors
result <- v1+v2
print(result)
◼ [1] 1 2 3 4 5
◼ [[1]]
◼ [1] 10 11 12 13 14
◼ [1] 1 2 3 4 5
◼ [1] 10 11 12 13 14
◼ [1] 11 13 15 17 19
cat("Number of rows:\n")
print(nrow(A))
cat("Number of columns:\n")
print(ncol(A))
cat("Number of elements:\n")
print(length(A))
# 2nd-row deletion
A = A[, -2]
cat("After deleted the 2nd column\n")
print(A)
Dr.Anisha P Rodrigues, Dept of
9/21/2024 CSE,NMAMIT, Nitte
Finding the sum of rows, columns, and total in a
matrix in R
◼ To find the sum of row, columns, and total in a matrix can be simply
done by using the functions rowSums, colSums, and sum
respectively.
◼ Example:
M1<−matrix(1:25,nrow=5)
print(M1)
Output:
[,1] [,2] [,3] [,4] [,5]
[1,] 1 6 11 16 21
[2,] 2 7 12 17 22
[3,] 3 8 13 18 23
[4,] 4 9 14 19 24
[5,] 5 10 15 20 25
Dr.Anisha P Rodrigues, Dept of
9/21/2024 CSE,NMAMIT, Nitte
Dr.Anisha P Rodrigues, Dept of
9/21/2024 CSE,NMAMIT, Nitte
Calculations Across Array Elements
if(boolean_expression) {
# statement(s) will execute if the Boolean expression is true.
}
◼ If the Boolean expression evaluates to be true, then the block of
code inside the if statement will be executed.
◼ If Boolean expression evaluates to be false, then the first set of
code after the end of the if statement (after the closing curly brace)
will be executed.
❑ [1] “x is an Integer"
x <- -5
if(x > 0){
print("Non-negative number")
} else {
print("Negative number")
}
if(boolean_expression 1) {
# Executes when the boolean expression 1 is true.
}else if( boolean_expression 2) {
# Executes when the boolean expression 2 is true.
}else if( boolean_expression 3) {
#Executes when the boolean expression 3 is true.
}else {
# executes when none of the above condition is true.
}
used.
❑ No default argument case is available there in R
switch case.
❑ An unnamed case can be used, if there is no matched
case.
y = "18"
x = switch(
y,
"9"="Hello Arpita",
"12"="Hello Vaishali",
"18"="Hello Nishka",
"21"="Hello Shubham"
)
print (x)
Dr.Anisha P Rodrigues, Dept of
9/21/2024 CSE,NMAMIT, Nitte
R – Loops
◼ There may be a situation when you need to execute a block of code
several number of times. In general, statements are executed
sequentially. The first statement in a function is executed first,
followed by the second, and so on.
◼ Programming languages provide various control structures that
allow for more complicated execution paths.
◼ A loop statement allows us to execute a statement or group of
statements multiple times and the following is the general form of a
loop statement in most of the programming languages:
while (test_expression) {
statement
}
Syntax
❑ [1] 2
❑ [1] 3
❑ [1] 4
❑ [1] 5
break
◼ Syntax
next
❑ [1] 4
❑ [1] 9
❑ [1] 16
❑ [1] 25
❑ [1] 36
◼ [1] 58
❑ [1] 45
◼ [1] 6
srt<-function(a){
v<-sort(a, decreasing = TRUE)
print("DESCENDING ORDER")
print(v)
x<-sort(a, decreasing = FALSE)
print("ASCENDING ORDER")
print(x)
}
a <-scan(nlines=6)
srt(a)
❑ Double quotes can be inserted into a string starting and ending with single quote.
❑ Single quote can be inserted into a string starting and ending with double quotes.
❑ Double quotes can not be inserted into a string starting and ending with double
quotes.
❑ Single quote can not be inserted into a string starting and ending with single
quote.
◼ print(b)
◼ print(c)
◼ print(d)
◼ nchar(x)
◼ toupper(x)
◼ tolower(x)
◼ Example
# Extract characters from 5th to 7th position.
result <- substring("Extract", 5, 7)
print(result)
◼ When we execute the above code, it produces the following result:
◼ [1] "act"
# Extract 3rd and 5th row with 2nd and 4th column.
result <- emp.data[c(3,5),c(2,4)]
print(result)
◼ Add Column
❑ Just add the column vector using a new column name.
◼ .libPaths()
◼ library(package Name)
id,name,salary,start_date,dept
1,Rick,623.3,2012-01-01,IT
2,Dan,515.2,2013-09-23,Operations
3,Michelle,611,2014-11-15,IT
4,Ryan,729,2014-05-11,HR
,Gary,843.25,2015-03-27,Finance
6,Nina,578,2013-05-21,IT
7,Simon,632.8,2013-07-30,Operations
8,Guru,722.5,2014-06-17,Finance
◼ install.packages("dplyr") ## install
◼ library("dplyr") ## load
◼ exp(1)
◼ ## [1] 2.718282
◼ 1 %>% exp
◼ ## [1] 2.718282
◼ 1 %>% exp %>% log
◼ ## [1] 1
◼ Simple example
◼ mtcars %>% head(4)
◼ head(mtcars, 4)
# Create DataFrame
df <- data.frame(
id = c(10,11,12,13,14,15,16,17),
name = c('sai','ram','deepika','sahithi','kumar','scott','Don','Lin'),
gender = c('M','M','F','F','M','M','M','F'),
dob = as.Date(c('1990-10-02','1981-3-24','1987-6-14','1985-8-16',
'1995-03-02','1991-6-21','1986-3-24','1990-8-26')),
state = c('CA','NY',NA,NA,'DC','DW','AZ','PH'),
row.names=c('r1','r2','r3','r4','r5','r6','r7','r8')
)
df
◼ This takes the first argument as the data frame and the
second argument is the variable name or vector of
variable names.
◼ # select() single column
◼ df %>% select('id')
◼ # select() multiple columns
◼ df %>% select(c('id','name'))
◼ # Select multiple columns by id
◼ df %>% select(c(1,2))
Dr.Anisha P Rodrigues, Dept of
9/21/2024 CSE,NMAMIT, Nitte
Dr.Anisha P Rodrigues, Dept of
9/21/2024 CSE,NMAMIT, Nitte
dplyr::mutate()
◼ library(stringr)
# Replace on selected column
df %>%
mutate(name = str_replace(name, "sai", "SaiRam"))
# group_by() on department
grp_tbl <- df %>% group_by(department)
grp_tbl
library(dplyr)
# Read datasets from different files or sources
dataset1 <- read.csv("emp.csv")
print(dataset1)
dataset2 <- read.csv("salary.csv")
print(dataset2)
print("Merged Data:")
print(merged_data)
group_by(Gender) %>%
summarise(
total_salary = sum(Salary),
average_age = mean(Age),
count = n()
)
print("Aggregated Data:")
print(aggregated_data )
print("Filtered Data:")
print(filtered_data)
mutate(
doubled_salary = Salary * 2,
seniority = ifelse(Age > 28, "Senior", "Junior")
)
print("Transformed Data:")
print(transformed_data)
library(rvest)
Wiki_page<- read_html("https://www.wikipedia.org/")
class(Wiki_page)
Wiki_page
Output:
[1] "xml_document" "xml_node"
print(url_title1)
# Output
# [1] "English" "Español" "Русский" "日本語" "Deutsch" "Français"
# [7] "Italiano" "中文" "" "فارسیPortuguês“
<div>
<a class="button" href="http://scrapfly.io">ScrapFly</a>
</div>
◼ //div/a[contains(@class, "button")]/@href
◼ output:
❑ http://scrapfly.io
Dr.Anisha P Rodrigues, Dept of
9/21/2024 CSE,NMAMIT, Nitte
Example
<div>
<p class="socials">
Follow us on
<a href="https://twitter.com/@scrapfly_dev">Twitter!</a>
</p>
</div>
div/p/a
output:
<a href="https://twitter.com/@scrapfly_dev">Twitter!</a>
<div>
<p class="socials">
Follow us on
<a href="https://twitter.com/@scrapfly_dev">Twitter!</a>
</p>
</div>
/p[@class='socials’]/a
output:
<a href="https://twitter.com/@scrapfly_dev">Twitter!</a>
//p[@class='socials']/a[contains(@href, 'twitter.com’)]
output:
<a href="https://twitter.com/@scrapfly_dev">Twitter!</a>
Dr.Anisha P Rodrigues, Dept of
9/21/2024 CSE,NMAMIT, Nitte
Lab Program 7
◼ Syntax:
❑ trimws(x)
❑ Parameters:
◼ x: Object or character string
Output:
"Geeks_For_Geeks "
◼ install.packages("ggplot2")
◼ library(ggplot2)
data(mpg)
colnames(mpg)
40
highway mileage
factor(cyl)
30
4
5
6
8
20
2 3 4 5 6 7
Dr.Anisha P Rodrigues, Dept of
Displacement
9/21/2024 CSE,NMAMIT, Nitte
Change colors manually
A custom color palettes can be specified using the
functions :
scale_color_manual()
◼ These functions allow you to specify your own set of
mappings from levels in the data to aesthetic values.
40
highway mileage
factor(cyl)
30
4
5
6
8
20
2 3 4 5 6 7
Displacement
scale_fill_manual()
◼ These functions allow you to specify your own set of
mappings from levels in the data to aesthetic values.
◼ Use values to set the colors used for the levels in the
class column
40
30
drv
count
4
f
r
20
10
0.3
0.2
density
0.1
0.0
2 3 4 5 6 7
displ
◼ bins = 10
❑ The bins argument specifies the number of bins along each axis
(both x and y). Here, the data will be divided into 10 bins along
the x-axis and 10 bins along the y-axis, forming a 10x10 grid.
◼ aes(fill = ..count..):
❑ This is an aesthetic mapping that specifies how the color of each
bin will be determined. ..count.. refers to the number of data
points in each bin.
◼ The fill aesthetic applies a color based on the count, so the
more data points in a bin, the darker or more intense the color
(depending on the scale used).
❑ The special variables in ggplot with double periods around them
(..count.., ..density.., etc.) are returned by a stat transformation
of the original data set.
theme_classic()
◼ A classic-looking theme, with x and y axis lines and no
gridlines.
theme_dark()
◼ The dark cousin of theme_light(), with similar line sizes
but a dark background. Useful to make thin coloured
lines pop out.
theme_void()
◼ A completely empty theme.
Dr.Anisha P Rodrigues, Dept of
9/21/2024 CSE,NMAMIT, Nitte
Saving the plot
# 1. Create a plot: displayed on the screen (by default)
ggplot(mpg, aes(displ, hwy, colour = class)) + geom_point()
#2.Save the plot to a pdf
ggsave("myplot.pdf")
# 2. OR save it to png file
ggsave("myplot.png")
or
myplot<-ggplot(mpg, aes(displ, hwy, colour = class)) + geom_point()
ggsave("myplot.pdf“,myplot)
or
myplot<-ggplot(mpg, aes(displ, hwy, colour = class)) + geom_point()
png("myplot.png")
print(myplot)
dev.off()
9/21/2024 299
◼We use the ggplotly function to make it
interactive and pass the plot to the
function as an argument. Ggplotly
provides options like zoom-in, zoom-out,
lasso-select, etc.
ggplotly(p)
9/21/2024 300
9/21/2024 301
Lab Program 4
◼ titanic_test
300
200 Survived
Count
0
1
100
1 2 3
Passenger Class
500
400
300
Survived
Fare
0
1
200
100
0 20 40 60 80
Age
Frequency
Survived
300
200
100
90
Frequency
60
30
0 20 40 60 80
Age
is.na() function:
◼ is.na() is used to deal with missing values in the dataset
or data frame. We can use this method to check the NA
(Not Available) field in a data frame and help to fill them.
◼ It returns a TRUE corresponding to each missing value.
# is.na in r example
x = c(1, 2, NA, 4, NA, 6, 7)
# invoking is.na() to get NA's indexes
print(is.na(x))
◼ Syntax
◼ The basic syntax for calculating mean in R is −
❑ mean(x, na.rm = FALSE, ...)
❑ x is the input vector.
❑ na.rm is used to remove the missing values from the input vector.
80
60
Age
40
20
0
0 1
Survived
400
factor(Survived)
Count
0
1
200
0 1
Survived
t.test() Function in R
◼ R language provides us with a simple t.test built-in
function for One Sample, Two Samples, and Paired t-
tests.
◼
◼ 30.4153
◼ 28.42382
❑ 0: No correlation