Module_1&2_R Programming
Module_1&2_R Programming
SIDDHARTH JAMBHAVDEKAR
WHAT IS R?
• It is a great resource for data analysis, data visualization, data science and machine learning
• It provides many statistical techniques (such as statistical tests, classification, clustering and
data reduction)
• It is easy to draw graphs in R, like pie charts, histograms, box plot, scatter plot, etc++
• It works on different platforms (Windows, Mac, Linux)
• It is open-source and free
• It has a large community support
• It has many packages (libraries of functions) that can be used to solve different problem e.g.
MLR (Machine Learning in R).
WHAT IS R STUDIO?
• To install R, go to cran.r-project.org
• Select the language you would like to use during the installation. Then click OK.
• Click Next.
• Select where you would like R to be installed. It will default to your Program Files on your C Drive. Click Next.
• (Optional) If your computer is a 64-bit, you can choose the 64-bit User Installation. Then click Next.
• Then specify if you want to customized your startup or just use the defaults. Then click Next.
• Then you can choose the folder that you want R to be saved within or the default if the R folder that was created. Once you have finished, click Next.
• You can then select additional shortcuts if you would like. Click Next.
• Click Finish.
INSTALLATION OF R STUDIO
• Go to https://posit.co/downloads/
• Click Download RStudio.
• Once the packet has downloaded, the Welcome to RStudio Setup Wizard will open. Click
Next and go through the installation steps.
• After the Setup Wizard finishing the installation, RStudio will open.
SIMPLE SCRIPTS IN R
• print("Hello World!") // Even this will print the same thing but we are using print()
function in order to perform the operation.
• myString <- "Hello,
World!" print (myString)
• Here first statement defines a string variable myString, where we assign a string
"Hello, World!" and then next statement print() is being used to print the value stored
in variable myString.
COMMENTS IN R
• In order to write comments in R we using # and then type the comment we want.
• # This is a comment
"Hello World!“
• "Hello World!" # This is a comment
• # This is a
comment # written
in
# more than just one line
"Hello World!"
VARIABLES IN R
• You can also concatenate, or join, two or more elements, by using the paste() function.
• To combine both text and a variable, R uses comma (,):
• text <- "awesome"
• Output: R is awesome
• You can also use , to add a variable to another variable:
• text1 <- "R is"
text2 <- "awesome"
paste(text1, text2)
• Output: R is awesome
• For numbers, the + character works as a mathematical operator:
• num1 <- 5
num2 <- 10
num1 + num2
• Output: 15
MULTIPLE VARIABLES
• R allows you to assign the same value to multiple variables in one line:
• # Assign the same value to multiple variables in one line
var1 <- var2 <- var3 <- "Orange"
• # Legal variable
names: myvar <-
"John" my_var <-
"John" myVar <-
"John" MYVAR <-
"John"
myvar2 <- "John"
.myvar <- "John"
• Variables can store data of different types, and different types can do different things.
• In R, variables do not need to be declared with any particular type, and can even change
type after they have been set:
• my_var <- 30 # my_var is type of numeric
my_var <- "Sally" # my_var is now of type character (aka string)
• Basic data types in R can be divided into the following types:
• # numeric
x <- 10.5
class(x)
# integer
x <- 1000L
class(x)
# complex
x <- 9i + 3
class(x)
# character/string
x <- "R is exciting"
class(x)
# logical/boolean
x <- TRUE
class(x)
NUMBERS
• A numeric data type is the most common type in R, and contains any number with
or without a decimal, like: 10.5, 55, 787:
• x <- 10.5
y <- 55
• Integers are numeric data without decimals. This is used when you are certain that
you will never create a variable that should contain decimals. To create an integer
variable, you must use the letter L after the integer value:
• x <- 1000L
y <- 55L
• You can convert from one type to another with the following functions:
as.numeric()
as.integer()
as.complex()
• x <- 1L # integer
y <- 2 # numeric
• 10+5
• 10-5
• These is simple math operation we can do in R.
BUILT-IN MATH FUNCTIONS
• R also has many built-in math functions that allows you to perform mathematical tasks
on numbers.
• For example, the min() and max() functions can be used to find the lowest or highest number in a
set:
• max(5, 10, 15)
5
• The sqrt() function returns the square root of a number:
sqrt(16)
Output: 4
abs(-4.7)
Output: 4.7
• The ceiling() function rounds a number upwards to its nearest integer, and the floor() function rounds a number downwards to its nearest
integer, and returns the result:
ceiling(1.4)
floor(1.4)
Output: 2
1
STRINGS
cat(str)
STRING LENGTH
nchar(str)
Output: 12
CHECK A STRING
• Use the grepl() function to check if a character or a sequence of characters are present
in a string:
str <- "Hello World!"
grepl("H", str)
grepl("Hello",
str) grepl("X",
str)
Output: True
True
False
• x <- c('Geeks', 'Geeksfor', 'Geek',
• 'Geeksfor', 'Gfg')
• grep('Geek', x)
ESCAPE CHARACTERS
• To insert characters that are illegal in a string, you must use an escape character.
str <- "We are the so-called "Vikings", from the north."
str
str <- "We are the so-called \"Vikings\", from the north."
str
cat(str)
• An "if statement" is written with the if keyword, and it is used to specify a block of code to
be executed if a condition is TRUE:
• a <- 33
b <- 200
if (b > a) {
print("b is greater than a")
}
ELSE IF
• The else if keyword is R's way of saying "if the previous conditions were not true, then try this condition":
a <- 33
b <- 33
if (b > a) {
print("b is greater than a")
} else if (a == b) {
print ("a and b are equal")
}
IF ELSE
• The else keyword catches anything which isn't caught by the preceding conditions:
a <- 200
b <- 33
if (b > a) {
} else if (a == b) {
} else {
}
NESTED-IF STATEMENTS
• You can also have if statements inside if statements, this is called nested if
statements. x <- 41
if (x > 10)
{ print("Above
} else {
} else
{ print("below
10.")
}
WHILE LOOP
• With the while loop we can execute a set of statements as long as a condition is TRUE:
• Print i as long as i is less than 6:
• i <- 1
• while (i < 6) {
• print(i)
• i <- i + 1
• }
BREAK
• With the break statement, we can stop the loop even if the while condition is TRUE:
• i <- 1
• while (i < 6) {
• print(i)
• i <- i + 1
• if (i == 4) {
• break
• }
• }
NEXT
• With the next statement, we can skip an iteration without terminating the loop:
• i <- 0
• while (i < 6) {
• i <- i + 1
• if (i == 3) {
• next
• }
• print(i)
• }
QUIZ
(IF .. ELSE COMBINED WITH A WHILE LOOP)
• To demonstrate a practical example, let us say we play a game of Yahtzee!
• Print "Yahtzee!" If the dice number is 6:
• dice <- 1
while (dice <= 6) {
if (dice < 6) {
print("No Yahtzee")
} else {
print("Yahtzee!")
}
dice <- dice + 1
}
FOR LOOP
for (x in fruits) {
print(x)
}
QUIZ
(IF .. ELSE COMBINED WITH A FOR LOOP)
• Print "Yahtzee!" If the dice number is 6:
• dice <- 1:6
for(x in dice) {
if (x == 6) {
print(paste("The dice number is", x, "Yahtzee!"))
} else {
print(paste("The dice number is", x, "Not Yahtzee"))
}
}
NESTED LOOP
• It is also possible to place a loop inside another loop. This is called a nested loop:
• Print the adjective of each fruit in a list:
• adj <- list("red", "big", "tasty")
• Arguments are specified after the function name, inside the parentheses. You can add as many arguments as you want, just separate them with
a comma.
• The following example has a function with one argument (fname). When the function is called, we pass along a first name, which is used inside the
function to print the full name:
{ paste(fname, "Griffin")
my_function("Peter")
my_function("Lois")
my_function("Stewie")
• my_function <- function(fname, lname) {
paste(fname, lname)
}
my_function("Peter", "Griffin")
DEFAULT PARAMETER
my_function("Sweden")
my_function("India")
my_function() # will get the default value, which is Norway
my_function("USA")
RETURN VALUE
print(my_function(3))
print(my_function(5))
print(my_function(9))
NESTED FUNCTIONS
Nested_function(Nested_function(2,2), Nested_function(3,3))
FUNCTION WITHIN A FUNCTION
• R also accepts function recursion, which means a defined function can call itself.
• Recursion is a common mathematical and programming concept. It means that a function calls itself. This
has the benefit of meaning that you can loop through data to reach a result.
• The developer should be very careful with recursion as it can be quite easy to slip into writing a function
which never terminates, or one that uses excess amounts of memory or processor power. However, when
written correctly, recursion can be a very efficient and mathematically-elegant approach to programming.
• In this example, tri_recursion() is a function that we have defined to call itself ("recurse"). We use the k
variable as the data, which decrements (-1) every time we recurse. The recursion ends when the condition is
not greater than 0 (i.e. when it is 0).
• To a new developer it can take some time to work out how exactly this works, best way to find out is by
testing and modifying it.
• tri_recursion <- function(k) {
if (k > 0) {
result <- k + tri_recursion(k - 1)
print(result)
} else
{ result =
0
return(result)
}
}
tri_recursion(6)
GLOBAL VARIABLE
• Variables that are created outside of a function are known as global variables.
• Global variables can be used by everyone, both inside of functions and outside.
• Create a variable outside of a function and use it inside the function:
• txt <- "awesome"
my_function <- function()
{ paste("R is", txt)
}
my_function()
OBJECTS
• Vectors
• Lists
• Matrices
• Arrays
• Data Frames
• Factors
VECTORS
• To combine the list of items to a vector, use the c() function and separate the items by a comma.
• Z<- 2:7
• cat('using colon', Z)
• To create a vector with numerical values in a sequence, use the : operator:
numbers
• You can also create numerical values with decimals in a sequence, but note that if the last element does not belong to
the sequence, it is not used:
# Vector with numerical decimals in a sequence where the last element is not used
numbers2 <- 1.5:6.3
numbers2
TYPES OF VECTORS
• Numeric Vectors: Numeric vectors are those which contain numeric values such
as integer, float, etc.
• Character Vectors: Character vectors in R contain alphanumeric values and special
characters.
• Logical Vectors: Logical vectors in R contain Boolean values such as TRUE, FALSE and
NA for Null values.
• To find out how many items a vector has, use the length() function:
"orange") length(fruits)
sort(fruits, decreasing=TRUE)
• You can access the vector items by referring to its index number inside brackets []. The first item has index 1, the second item has index 2, and so on:
fruits[1]
• You can also access multiple elements by referring to different index positions with the c() function:
fruits[c(1, 3)]
• You can also use negative index numbers to access all items except the ones specified:
• fruits <- c("banana", "apple", "orange", "mango", "lemon")
# Print fruits
fruits
REPEAT VECTOR
repeat_each
repeat_times
• Repeat each value independently:
• repeat_indepent <- rep(c(1,2,3), times = c(5,2,1))
repeat_indepent
numbers
• #Deleting a
vector M<- c(8, 10,
2, 5)
M<- NULL
cat('Output vector', M)
APPENDING IN VECTOR
x <- 1:5
n <- 6:10
y <- c(x, n)
print(y)
x <- 1:5
print(x)
• Appending using
indexing my_vector <- c(1,
2, 3, 4)
my_vector[5] <- 5
my_vector[6] <- 6
my_vector
RANGE FUNCTION IN VECTOR
• Range function is used to get the minimum and maximum values of the vector passed
to it as an argument.
• # R program to find the minimum and maximum element of a vector
x <- c(8, 2, 5, 4, 9, 6, 54, 18)
range(x)
FORMAT FUNCTION IN VECTOR
• format() is used to show how the content will be visible. The alignment is based on three types left, right and
print(result1)
print(result2)
print(result3)
NUMBER FORMATTING
print(result1)
print(result2)
# Getting the specified minimum number of digits to the right of the decimal point.
print(result3)
print(result4)
• # Getting the number in the string form
• print(result1)
• print(result2)
• result3 <- format(12.3456789, scientific=TRUE) #here the output will be multiplied by the power of 10 means 1.23456789e+01 means 1.23456789 × 10¹, which
equals 12.3456789.
• print(result3)
• print(result4)
DATE AND TIME FORMATTING
Create two vectors: a <- c(2, 4, 6, 8, 10) and b <- c(1, 3, 5, 7, 9). Compute their sum,
difference, product, and division.
a <- c(2, 4, 6, 8, 10)
b <- c(1, 3, 5, 7, 9)
sum_vec <- a + b
diff_vec <- a - b
prod_vec <- a * b
div_vec <- a / b
print(sum_vec)
print(diff_vec)
print(prod_vec)
print(div_vec)
• Given a vector nums <- c(5, 12, 18, 25, 7, 30, 45), extract elements greater than 20.
nums <- c(5, 12, 18, 25, 7, 30, 45)
filtered_nums <- nums[nums > 20]
print(filtered_nums)
• Count how many times 5 appears in x <- c(1, 5, 3, 5, 7, 5, 9, 5).
x <- c(1, 5, 3, 5, 7, 5, 9, 5)
count_5 <- sum(x == 5)
print(count_5)
• Write a R program to append value to a given empty vector.
vector = c()
values = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
for (i in 1:length(values))
vector[i] <- values[i]
print(vector)
LISTS
• A list in R can contain many different data types inside it. A list is a collection of data which
is ordered and changeable.
• To create a list, use the list() function:
• # List of strings
thislist <- list("apple", "banana", "cherry")
empId = c(1, 2, 3, 4)
numberOfEmp = 4
empList =
list( "ID" =
empId,
"Names" = empName,
print(empList)
print(empList$Names)
ACCESSING COMPONENTS BY INDICES
# Creating a list by naming all its components # Accessing a top level components by indices
empId = c(1, 2, 3, 4) cat("Accessing name components using indices\n")
empName = c("Debi", "Sandeep", "Subham", print(empList[[2]])
"Shiba") numberOfEmp = 4 # Accessing a inner level components by indices
empList = cat("Accessing Sandeep from name using indices\n")
list( "ID" = print(empList[[2]][2])
empId, # Accessing another inner level components by
indices
"Names" = empName,
cat("Accessing 4 from ID using indices\n")
"Total Staff" = numberOfEmp
print(empList[[1]][4])
)
print(empList)
MODIFYING COMPONENTS OF LIST
empId, [5] = 5
"Names" = empName, empList[[2]][5] = "Kamala"
"Total Staff" = numberOfEmp
) cat("After modified the list\n")
cat("Before modifying the list\n")
print(empList)
print(empList)
CONCATENATION OF LIST
thislist[1]
• length(thislist)
• To find out if a specified item is present in a list, use the %in% operator:
• To add an item to the right of a specified index, add "after=index number" in the append() function:
• Add "orange" to the list after "banana" (index 2):
• You can specify a range of indexes by specifying where to start and where to end the range, by using the : operator:
• (thislist)[2:5]
• You can loop through the list items by using a for loop:
"cherry")
for (x in thislist) {
print(x)
list3
QUIZ
Question:
A company maintains an employee database using R lists. The list stores the following
details:
•Employee IDs as a numeric vector (e.g., c(1, 2, 3, 4))
•Employee Names as a character vector (e.g., c("Debi", "Sandeep", "Subham", "Shiba"))
•Total number of employees as a single numeric value
The HR manager wants to perform the following
operations:
1. Retrieve the list of employee names using both name-based and index-based access.
2. Update the names of employees with a new set of names.
3. Merge the employee list with another list containing department and location details.
4. Remove the total employee count from the list.
Write an R program to help the HR manager accomplish these tasks and display the results
accordingly.
# Creating an employee list
empList <- list(
ID = c(1, 2, 3, 4),
Names = c("Debi", "Sandeep", "Subham", "Shiba"),
Total_Staff = 4
)
• The whole column can be accessed if you specify a comma before the number in the bracket:
• thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol =
2) thismatrix[,2]
• Access More Than One Row
• More than one row can be accessed if you use the c() function:
• thismatrix <- matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "melon", "fig"), nrow = 3, ncol = 3)
thismatrix[c(1,2),]
• thismatrix <- matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "melon", "fig"), nrow = 3, ncol = 3)
newmatrix
Note: The cells in the new column must be of the same length as the existing matrix.
• thismatrix <- matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "melon", "fig"), nrow = 3, ncol = 3)
newmatrix
Note: The cells in the new row must be of the same length as the existing matrix.
• Remove Rows and Columns
• thismatrix <- matrix(c("apple", "banana", "cherry", "orange", "mango", "pineapple"), nrow = 3, ncol
thismatrix
• To find out if a specified item is present in a matrix, use the %in% operator:
• Matrix Length
• Use the length() function to find the dimension of a Matrix:
• thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol =
2) length(thismatrix)
• Loop Through a Matrix
• You can loop through a Matrix using a for loop. The loop will start at the first row, moving right:
• Loop through the matrix items and print them:
thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2)
for (rows in 1:nrow(thismatrix)) {
for (columns in 1:ncol(thismatrix))
{ print(thismatrix[rows, columns])
}
}
• Combine two Matrices
• Again, you can use the rbind() or cbind() function to combine two or more matrices
# Adding it as a rows
Matrix_Combined
# Adding it as a columns
Matrix_Combined
QUIZ
A hospital maintains a patient monitoring system where vital signs such as heart rate, blood pressure, and oxygen levels are
stored in a matrix. Each row represents a patient, and each column represents a different health metric. The hospital's data
analysts need to perform several operations to manage and analyze the data:
• Add new patients and new health metrics as rows and columns.
• Find the total number of patients and health metrics in the dataset.
• Merge multiple matrices from different hospital branches for consolidated analysis.
patient_data <- matrix(c(80, 120, 98, 75, 130, 95, 90, 140, 88), nrow = 3, byrow = TRUE, dimnames = list(c("Patient1", "Patient2", "Patient3"), c("Heart Rate", "Blood Pressure", "Oxygen Level")))
print("Initial Patient Data:")
print(patient_data)
temperature <- c(98.6, 99.0, 98.7, 98.5) # Added data for all patients
patient_data <- cbind(patient_data, Temperature = temperature)
print("Updated Patient Data with New Patient and Metric:")
print(patient_data)
# Removing Patient 2 (2nd row) and Blood Pressure (2nd column) using index numbers
patient_data <- patient_data[-2, ]
patient_data <- patient_data[, -2]
print("Data After Removing Patient 2 and 'Blood Pressure' Metric:")
print(patient_data)
critical_values <- patient_data > 100 # Check for heart rate > 100
print("Critical Values (Emergency Cases):")
print(critical_values)
for (i in 1:total_patients) {
cat(paste("Analyzing Data for", rownames(patient_data)[i], ":\n"))
cat(paste("Heart Rate:", patient_data[i, "Heart Rate"], "\n"))
cat(paste("Oxygen Level:", patient_data[i, "Oxygen Level"], "\n"))
cat(paste("Temperature:", patient_data[i, "Temperature"], "\n"))
cat("\n")
}
branch2_data <- matrix(c(85, 130, 95, 90, 125, 99, 92, 135, 98), nrow = 3, byrow = TRUE, dimnames = list(c("Patient5", "Patient6", "Patient7"),c("Heart Rate", "Blood Pressure", "Oxygen Level")))
• You can access the array elements by referring to the index position. You can use the [] brackets to access the desired elements from an
multiarray[2, 3, 2]
• You can also access the whole row or column from a matrix in an array, by using the c()
# Access all the items from the first row from matrix one
multiarray[c(1),,1]
• A comma (,) before c() means that we want to access the column.
• A comma (,) after c() means that we want to access the row.
Check if an Item Exists
• To find out if a specified item is present in an array, use the %in% operator:
• Check if the value "2" is present in the array:
• thisarray <- c(1:24)
• multiarray <- array(thisarray, dim = c(4, 3, 2))
• 2 %in% multiarray
• Amount of Rows and Columns
• Use the dim() function to find the amount of rows and columns in an array:
• dim(multiarray)
• Array Length
• length(multiarray)
• Loop Through an Array
• You can loop through the array items by using a for loop:
thisarray <- c(1:24)
multiarray <- array(thisarray, dim = c(4, 3, 2))
for(x in multiarray){
print(x)
}
A library maintains a system to track the availability of books in different genres and their popularity ratings. The library uses an
array to store the data. Each row represents a book, each column represents a different attribute (such as genre and popularity
rating), and each "layer" of the array represents different libraries within a region. The library's data analysts need to perform
several operations to manage and analyze the data:
• Add new books and new attributes (like genre or rating) as rows and columns in the array.
• Check if a book has a critical rating (e.g., below 3, indicating low popularity).
• Loop through the array to analyze each book's details (genre, rating).
• Merge multiple arrays from different library branches for consolidated analysis.
book_data <- array(c("Fiction", "5", "Non-fiction", "4", "Mystery", "2", "Sci-Fi", "5", "Fiction", "3", "Biography", "4"), dim = c(3, 2, 2), total_attributes <- dim(book_data)[2]
dimnames = list(c("Book1", "Book2", "Book3"), c("Genre", "Rating"), c("Library1", "Library2")))
print(paste("Total Books:", total_books))
for (i in 1:total_books) {
print("Book1 Data in Library1 (Genre & Rating):")
for (j in 1:dim(book_data)[3]) {
print(book1_data_library1)
cat(paste("Analyzing", dimnames(book_data)[[1]][i], "Data in Library", dimnames(book_data)[[3]][j], ":\n"))
new_book <- array(c("Romance", "4", "Romance", "4"), dim = c(1, 2, 2),dimnames = list("Book4", c("Genre", "Rating"), c("Library1",
cat(paste("Genre:", book_data[i, "Genre", j], "\n"))
"Library2")))
if (!is.null(dimnames(book_data)[[2]]) && "Rating" %in% dimnames(book_data)[[2]]) {
book_data <- array(c(book_data, new_book), dim = c(4, 2, 2), dimnames = list(c("Book1", "Book2", "Book3", "Book4"), c("Genre",
"Rating"), c("Library1", "Library2"))) cat(paste("Rating:", book_data[i, "Rating", j], "\n"))
}
print("Updated Book Data with New Book:")
cat("\n")
print(book_data)
}
# Removing Book2 (2nd row) and Rating (2nd column) using index numbers
}
book_data <- book_data[-2, , ] # Removes second row (Book2) library3_data <- array(c("Fiction", "4", "Non-fiction", "5", "Mystery", "3",
critical_ratings <- as.numeric(book_data[, "Rating", ]) < 3 # Check for ratings less than 3
c("Library3")))
print("Books with Critical Ratings (Below 3):") merged_data <- array(c(book_data, library3_data), dim = c(3, 2, 3), dimnames = list(c("Book1", "Book3", "Book4"), c("Genre", "Rating"), c("Library1", "Library2", "Library3")))
• Data Frames can have different types of data inside it. While the first column can be character, the second and third can be numeric or logical. However, each column should
have the same type of data.
• Example
• )
• Data_Frame
SUMMARIZE THE DATA
• Use the summary() function to summarize the data from a Data Frame:
• Example
• )
• Data_Frame
• summary(Data_Frame)
ACCESSING ITEMS
• We can use single brackets [ ], double brackets [[ ]] or $ to access columns from a data frame:
• Example
• )
• Data_Frame[1]
• Data_Frame[["Training"]]
• Data_Frame$Training
ADD ROWS
• Add Rows
• Example
• )
• New_row_DF
ADD COLUMNS
• Example
• )
• New_col_DF
REMOVE ROWS AND COLUMNS
Use the c() function to remove rows and columns in a Data Frame:
Example
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"), Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
# Remove the first row and column Data_Frame_New <- Data_Frame[-c(1), -c(1)]
Use the dim() function to find the amount of rows and columns in a Data Frame:
Example
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"), Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
dim(Data_Frame)
You can also use the ncol() function to find the number of columns and nrow() to find the number of rows:
Example
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"), Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
ncol(Data_Frame) nrow(Data_Frame)
DATA FRAME LENGTH
Use the length() function to find the number of columns in a Data Frame (similar to ncol()):
Example
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"), Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
length(Data_Frame)
COMBINING DATA FRAMES
Use the rbind() function to combine two or more data frames in R vertically:
Example
Data_Frame1 <- data.frame (
Training = c("Strength", "Stamina", "Other"), Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Example
Data_Frame3 <- data.frame (
Training = c("Strength", "Stamina", "Other"), Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
A university maintains a system to track student enrollment and academic performance across different departments. The university uses a
data frame to store the data. Each row represents a student, and each column represents different attributes such as student name,
department, GPA, and enrollment status. The university's data analysts need to perform several operations to manage and analyze the data:
• Retrieve student information by accessing specific rows and columns.
• Add new students and new attributes (e.g., department, GPA) as rows and columns in the data frame.
• Remove students who have graduated or withdrawn from the university.
• Check if any students have a GPA below 2.0 (indicating academic probation).
• Find the total number of students and attributes in the dataset.
• Calculate the total number of records in the data frame.
• Loop through the data frame to analyze each student’s data (name, department, GPA).
• Merge multiple data frames from different departments for consolidated analysis.
student_data <- data.frame( total_students <- nrow(student_data)
Name = c("John Doe", "Jane Smith", "Mary Johnson", "Mike Lee"), print(paste("Total Students:", total_students))
new_student <- data.frame(StudentID = 105, Name = "Sarah Lee", Department = "Mathematics", GPA = 3.9,
cat("\n")
EnrollmentStatus = "Active")
}
student_data <- rbind(student_data, new_student)
math_department_data <- data.frame(
email_data <- data.frame(Email = c("[email protected]", "[email protected]", "[email protected]",
"[email protected]", "[email protected]")) StudentID = c(106, 107),
Example
# Create a factor
music_genre <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz"))
Example
music_genre <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz")) levels(music_genre)
You can also set the levels, by adding the levels argument inside the factor() function:
Example
music_genre <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz"), levels = c("Classic", "Jazz", "Pop
levels(music_genre)
FACTOR LENGTH
Use the length() function to find out how many items there are in the factor:
Example
music_genre <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz")) length(music_genre)
ACCESS FACTORS
To access the items in a factor, refer to the index number, using [] brackets:
Example
Access the third item:
music_genre <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz")) music_genre[3]
CHANGE ITEM VALUE
music_genre[3]
SPECIAL VALUES IN R
mean(x)
y <- 0/0
print(y)
z1 <- 1/0
z2 <- -1/0
print(z1)
print(z2)
v <- NULL
length(v)
TREATING MISSING VALUES
• The cut() function in R is used to divide continuous numerical values into discrete
categories (also called binning or bucketing). This is useful for creating groups
from numerical data.
• df$AgeGroup <- cut(df$Age, breaks = c(0, 30, 50, Inf),
• labels = c("Young", "Middle-aged", "Senior"))
• print(df)
IMPLEMENTING DATA STRUCTURES ON BUILT-IN DATA
SETS
• Vector in R (1D Homogenous
data("mtcars")
print(mpg_vector)
print(car_names)
print(high_mpg)
• Lists in R
# Creating a list with different components
car_list <- list(
mpg_values = mtcars$mpg, # Numeric vector
car_names = rownames(mtcars), # Character vector
first_five = head(mtcars, 5) # Data frame (first 5
rows)
)
print(car_list)
• Matrices in R (2D Homogenous Data)
# Extracting first 10 rows and 3 numeric columns
car_matrix <- as.matrix(mtcars[1:10, c("mpg", "hp", "wt")])
print(car_matrix)
# Matrix operations
col_means <- colMeans(car_matrix) # Column-wise mean
print(col_means)
print(df)
• Factors in R
# Convert transmission type (0 = Automatic, 1 = Manual) into a factor
mtcars$am <- factor(mtcars$am, levels = c(0, 1), labels = c("Automatic", "Manual"))
# Check structure
str(mtcars$am)
# Count occurrences
table(mtcars$am)
PLOT
Example
Draw one point in the diagram, at position (1) and position (3):
plot(1, 3)
• To draw more points, use vectors:
• Example
• Draw two points in the diagram, one at position (1, 3)
• and one in position (8, 10):
• plot(c(1, 8), c(3, 10))
MULTIPLE POINTS
• You can plot as many points as you like, just make sure you have the same number
of points in both axis:
• Example
• plot(c(1, 2, 3, 4, 5), c(3, 7, 8, 9, 12))
• For better organization, when you have many values, it is better to use variables:
• Example
• x <- c(1, 2, 3, 4, 5)
y <- c(3, 7, 8, 9, 12)
plot(x, y)
DRAW A SEQUENCE OF POINTS
If you want to draw dots in a sequence, on both the x-axis and the y-axis, use the : operator:
Example
plot(1:10)
DRAW A LINE
The plot() function also takes a type parameter with the value l to draw a line to connect all the points in the diagram:
Example
plot(1:10, type="l")
PLOT LABELS
The plot() function also accept other parameters, such as main, xlab and ylab if you want to customize the
graph with a main title and different labels for the x and y-axis:
Example
plot(1:10, main="My Graph", xlab="The x-axis", ylab="The y axis")
GRAPH APPEARENCE
There are many other parameters you can use to change the appearance of the points.
Colors
Use col="color" to add a color to the points:
Example
plot(1:10, col="red")
SIZE
Use cex=number to change the size of the points (1 is default, while 0.5 means 50% smaller, and 2 means
100% larger):
Example
plot(1:10, cex=2)
POINT SHAPE
Use pch with a value from 0 to 25 to change the point shape format:
Example
plot(1:10, pch=25, cex=2)
LINE GRAPH
A line graph has a line that connects all the points in a diagram.
To create a line, use the plot() function and add the type parameter with a value of "l":
Example
plot(1:10, type="l")
LINE COLOR
The line color is black by default. To change the color, use the col parameter:
Example
plot(1:10, type="l", col="blue")
LINE WIDTH
To change the width of the line, use the lwd parameter (1 is default, while 0.5 means 50% smaller, and 2 means 100% larger):
Example
plot(1:10, type="l", lwd=2)
LINE STYLES
The line is solid by default. Use the lty parameter with a value from 0 to 6 to
specify the line format.
For example, lty=3 will display a dotted line instead of a solid line:
Example
plot(1:10, type="l", lwd=5, lty=3)
To display more than one line in a graph, use the plot() function together
with the lines() function:
Example
line1 <- c(1,2,3,4,5,10)
line2 <- c(2,5,7,8,9,10)
You learned from the Plot chapter that the plot() function is used to plot numbers
against each other.
A "scatter plot" is a type of plot used to display the relationship between two
numerical variables, and plots one dot for each observation.
It needs two vectors of same length, one for the x-axis (horizontal) and one for the
y-axis (vertical):
Example
x <- c(5,7,8,7,2,2,9,4,11,12,9,6)
y <- c(99,86,87,88,111,103,87,94,78,77,85,86)
plot(x, y)
• The observation in the example above should show the result of 12 cars passing by.
• That might not be clear for someone who sees the graph for the first time, so let's add a header and
different labels to describe the scatter plot better:
• Example
• x <- c(5,7,8,7,2,2,9,4,11,12,9,6)
y <- c(99,86,87,88,111,103,87,94,78,77,85,86)
In the example above, there seems to be a relationship between the car speed and
age, but what if we plot the observations from another day as well? Will the scatter
plot tell us something else?
To compare the plot with another plot, use the points() function:
Example
Draw two plots on the same figure:
# day one, the age and speed of 12 cars:
x1 <- c(5,7,8,7,2,2,9,4,11,12,9,6)
y1 <- c(99,86,87,88,111,103,87,94,78,77,85,86)
Example
# Create a vector of
pies x <-
c(10,20,30,40)
As you can see the pie chart draws one pie for each value in the vector (in this
case 10, 20, 30, 40).
By default, the plotting of the first pie starts from the x-axis and
move counterclockwise.
Note: The size of each pie is determined by comparing the value with all the
other values, by using this formula:
The value divided by the sum of all values: x/sum(x)
START ANGLE
You can change the start angle of the pie chart with the init.angle parameter.
The value of init.angle is defined with angle in degrees, where default angle is
0.
Example
Start the first pie at 90 degrees:
# Create a vector of
pies x <-
c(10,20,30,40)
Use the label parameter to add a label to the pie chart, and use
the main parameter to add a header:
Example
# Create a vector of
pies x <-
c(10,20,30,40)
You can add a color to each pie with the col parameter:
Example
# Create a vector of colors
colors <- c("blue", "yellow", "green", "black")
To add a list of explanation for each pie, use the legend() function:
Example
# Create a vector of labels
mylabel <- c("Apples", "Bananas", "Cherries", "Dates")
A bar chart uses rectangular bars to visualize data. Bar charts can be displayed
horizontally or vertically. The height or length of the bars are proportional to the
values they represent.
Use the barplot() function to draw a vertical bar chart:
Example
# x-axis values
x <- c("A", "B", "C", "D")
# y-axis values
y <- c(2, 4, 6, 8)
barplot(y, names.arg = x)
Example
x <- c("A", "B", "C", "D")
y <- c(2, 4, 6, 8)
Example
x <- c("A", "B", "C", "D")
y <- c(2, 4, 6, 8)
Example
x <- c("A", "B", "C", "D")
y <- c(2, 4, 6, 8)
If you want the bars to be displayed horizontally instead of vertically, use horiz=TRUE:
Example
x <- c("A", "B", "C", "D")
y <- c(2, 4, 6, 8)
Example
# Print the mtcars data
set mtcars
INFORMATION ABOUT DATA SET
You can use the question mark (?) to get information about the mtcars data set:
Example
# Use the question mark to get information about the data set
?mtcars
GET INFORMATION
Use the dim() function to find the dimensions of the data set, and the names() function to view the names of the variables:
Example
Data_Cars <- mtcars # create a variable of the mtcars data set for better organization
# Use names() to find the names of the variables from the data set names(Data_Cars)
Use the rownames() function to get the name of each row in the first column, which is the name of each car:
Example
Data_Cars <- mtcars rownames(Data_Cars)
If you want to print all values that belong to a variable, access the data frame by using the $ sign, and the name of the variable (for example cyl (cylin
Example
Data_Cars <- mtcars
Data_Cars$cyl
To sort the values, use the sort() function:
Example
Data_Cars <- mtcars sort(Data_Cars$cyl)
ANALYZING THE DATA
Now that we have some information about the data set, we can start to analyze it with some statistical numbers.
For example, we can use the summary() function to get a statistical summary of
the data:
Example
Data_Cars <- mtcars summary(Data_Cars)
MEAN
mean(Data_Cars$w
t)
MEDIAN
median(Data_Cars$wt)
MODE
• The mode value is the value that appears the most number of times.
• R does not have a function to calculate the mode. However, we can create our own
function to find it.
• Data_Cars <- mtcars
names(sort(-table(Data_Cars$wt)))[1]
PERCENTILES
• Percentiles are used in statistics to give you a number that describes the value that
a given percent of the values are lower than.
• Data_Cars <- mtcars
Example
Data_Cars <- mtcars quantile(Data_Cars$wt)