0% found this document useful (0 votes)
17 views

Programming With R

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Programming With R

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

B.M.S.

COLLEGE OF ENGINEERING
(Autonomous College under VTU)
Bull Temple Road, Basavangudi, Bangalore - 560019

Lab Observation

on

Programming with R

Submitted by

Mayur Shridhar Hegde (1BM22CD040)

in fulfillment of mandatory observation submission for Lab assessment

BACHELOR OF ENGINEERING
in
Computer Science & Engineering (Data Science)

Under the Guidance of

Dr. Kalyan N
Assistant Professor
Department of CSE (Data Science),
B.M.S. College of Engineering

2024-2025
TABLE OF CONTENTS

Date of Faculty
SI No Program Marks Page No
Execution In-Charge Sign
Arithmetic Operations,
1 Variable Assignment, 18/10/2024 2
and Conditional Statements
Creating and Manipulating
2 25/10/2024 5
Data Structures
Basic Statistical
3 Operations on 25/10/2024 13
Open-Source Datasets
Data Import, Cleaning,
4 and Export with 08/11/2024 22
Advanced Data Wrangling
Advanced Data
5 Manipulation with dplyr 08/11/2024 28
and Complex Grouping
Data Visualization with
6 15/11/2024 36
ggplot2 and Customizations
Linear and Multiple
7 Regression Analysis 22/11/2024 42
with Interaction Terms
K-Means Clustering
8 and PCA for 22/11/2024 50
Dimensionality Reduction
Time Series Analysis
9 using ARIMA and 20/12/2024 58
Seasonal Decomposition
Interactive Visualization
10 with plotly and Dynamic 20/12/2024 76
Reports with RMarkdown

1
Program - 1

Arithmetic Operations, Variable Assignment,

and Conditional Statements

DATE - 18/10/2024

1. Introduction

This R program determines whether three given side lengths form a valid triangle. If valid, the program
identifies the type of triangle (Equilateral, Isosceles, or Scalene) and calculates its area using Heron’s formula.
This process involves input validation, application of geometric principles, and computation techniques, all
explained step by step.

2. Code Explanation

2.1 Function Definitions

Valid Triangle Check

The is_valid_triangle function verifies if the given side lengths satisfy the triangle inequality theorem, a
fundamental condition for forming a triangle.

is_valid_triangle <- function(a, b, c) {


return((a + b > c) & (b + c > a) & (a + c > b))
}

• Input: Side lengths a, b, and c.


• Output: TRUE if the sides form a valid triangle, otherwise FALSE.

Triangle Type

The triangle_type function identifies the type of triangle based on its side lengths.

triangle_type <- function(a, b, c) {


if (a == b && b == c) {
return("Equilateral")
} else if (a == b || b == c || a == c) {
return("Isosceles")
} else {
return("Scalene")
}
}

• Logic:
– Equilateral: All sides are equal.
– Isosceles: Two sides are equal.
– Scalene: All sides are different.

2
Area Calculation

The triangle_area function calculates the area of a triangle using Heron’s formula.

triangle_area <- function(a, b, c) {


s <- (a + b + c) / 2 # Semi-perimeter
area <- sqrt(s * (s - a) * (s - b) * (s - c))
return(area)
}

• Input: Side lengths a, b, and c.


• Output: Area of the triangle.
• Formula:
– Semi-perimeter: s = (a + b + c) / 2
– Area: sqrt(s * (s - a) * (s - b) * (s - c))

Input Validation

The validate_input function ensures that each input is a positive numeric value.

validate_input <- function(x) {


if (!is.numeric(x) || x <= 0) {
stop("Error: Input must be a positive number.")
}
return(TRUE)
}

• Logic:
– Non-numeric or non-positive values trigger an error.

2.2 Main Code Block

This block integrates the functions to perform the required operations interactively.

# cat("Enter the lengths of the sides of the triangle:\n")


# a <- as.numeric(readline(prompt = "Side a: "))
# b <- as.numeric(readline(prompt = "Side b: "))
# c <- as.numeric(readline(prompt = "Side c: "))
# The above is the code to take input from user
# But here default values are used
# Define the lengths of the sides
a <- 3 # Side a
b <- 4 # Side b
c <- 5 # Side c

# Print the values


cat("Side a:", a, "\n")

## Side a: 3

cat("Side b:", b, "\n")

## Side b: 4

3
cat("Side c:", c, "\n")

## Side c: 5

# Input validation and triangle processing


tryCatch({
validate_input(a)
validate_input(b)
validate_input(c)

# Check if the inputs form a valid triangle


if (!is_valid_triangle(a, b, c)) {
stop("Error: The given sides do not form a valid triangle.")
}

# Determine the type of triangle


type_of_triangle <- triangle_type(a, b, c)
cat("The triangle is:", type_of_triangle, "\n")

# Calculate the area of the triangle


area_of_triangle <- triangle_area(a, b, c)
cat("The area of the triangle is:", area_of_triangle, "\n")
}, error = function(e) {
cat(e$message, "\n")
})

## The triangle is: Scalene


## The area of the triangle is: 6

• Input Handling: The program prompts the user to input the three side lengths, converting them to
numeric values for further processing.
• Error Handling: The tryCatch block manages errors efficiently, ensuring that the program does not
crash due to invalid input or failed validations. If an error is encountered, a meaningful message is
displayed.
• Sequential Processing:
1. Each side length is validated using the validate_input function.
2. The validity of the triangle is checked with the is_valid_triangle function.
3. If valid, the type of the triangle is determined and displayed.
4. Finally, the area of the triangle is computed and printed.

3. Conclusion

This R program effectively validates triangle properties, identifies its type, and calculates its area using
well-structured functions. Each step follows mathematical and computational principles, ensuring accurate
results. This program is a practical tool for triangle analysis and can be extended for educational or practical
purposes.

4
Program - 2

Creating and Manipulating Data Structures

DATE - 25/10/2024

1. Introduction

This document demonstrates various operations in R, including vector manipulation, matrix operations, list
handling, and data frame analysis. The primary aim was to perform several data manipulations using R’s
built-in functions and packages.

2. Vector Operations

Creating a Random Vector

The first step is to generate a random vector using the runif() function, which creates uniformly distributed
random numbers. This vector consists of 20 random values between 1 and 100.

set.seed(42)
random_vector <- runif(20, min = 1, max = 100)
cat("Random vector created:\n", head(random_vector,8))

## Random vector created:


## 91.5658 93.77047 29.32781 83.21431 64.53281 52.3905 73.92224 14.33199

Sorting the Vector

Next, the vector is sorted in ascending order using the sort() function. Sorting helps to organize the data
and identify trends or outliers.

sorted_vector <- sort(random_vector)


cat("Sorted vector:\n", head(sorted_vector))

## Sorted vector:
## 12.63125 14.33199 26.28745 29.32781 46.31644 46.76699

Searching for a Value

A specific value (50) is searched for within the random vector. The any() function is used to check if the
value exists in the vector.

search_value <- 50
is_value_present <- any(random_vector == search_value)
cat("Is", search_value, "present in the vector? ", is_value_present)

## Is 50 present in the vector? FALSE

5
Filtering Values Greater Than 60
Values greater than 60 are filtered from the vector using logical indexing. Filtering is helpful to subset data
based on certain conditions.

values_greater_than_60 <- random_vector[random_vector > 60]


cat("Values greater than 60:\n", head(values_greater_than_60))

## Values greater than 60:


## 91.5658 93.77047 83.21431 64.53281 73.92224 66.04224

3. Matrix Operations

Creating a Matrix
The random vector is reshaped into a 4x5 matrix using the matrix() function. This matrix format allows
for more complex data operations.

matrix_from_vector <- matrix(random_vector, nrow = 4, ncol = 5)


cat("Matrix created from the vector:\n")

## Matrix created from the vector:

print(matrix_from_vector)

## [,1] [,2] [,3] [,4] [,5]


## [1,] 91.56580 64.53281 66.04224 93.53255 97.84442
## [2,] 93.77047 52.39050 70.80141 26.28745 12.63125
## [3,] 29.32781 73.92224 46.31644 46.76699 48.02471
## [4,] 83.21431 14.33199 72.19211 94.06144 56.47294

Matrix Transpose and Multiplication


The matrix is transposed using the t() function, and a matrix multiplication is performed using the %*%
operator. These operations are important in linear algebra for solving systems of equations.

matrix_transpose <- t(matrix_from_vector)


matrix_multiplication_result <- matrix_from_vector %*% matrix_transpose
cat("Matrix multiplication result:\n")

## Matrix multiplication result:

print(matrix_multiplication_result)

## [,1] [,2] [,3] [,4]


## [1,] 35232.22 20337.59 19587.86 27635.57
## [2,] 20337.59 17401.08 11738.17 16851.17
## [3,] 19587.86 11738.17 12963.36 13954.70
## [4,] 27635.57 16851.17 13954.70 24378.48

Element-wise Matrix Multiplication


Element-wise multiplication is performed by multiplying the matrix by itself. This operation computes the
product for each corresponding element.

6
elementwise_multiplication_result <- matrix_from_vector * matrix_from_vector
cat("Element-wise multiplication result:\n")

## Element-wise multiplication result:

print(elementwise_multiplication_result)

## [,1] [,2] [,3] [,4] [,5]


## [1,] 8384.2954 4164.483 4361.577 8748.3384 9573.5298
## [2,] 8792.9003 2744.764 5012.840 691.0302 159.5484
## [3,] 860.1207 5464.498 2145.212 2187.1513 2306.3729
## [4,] 6924.6222 205.406 5211.701 8847.5541 3189.1932

4. List Operations

Creating a List

A list is created combining various data types, including numeric vectors, characters, logical values, and
matrices. Lists in R can hold elements of different types, unlike vectors and matrices.

my_list <- list(


numbers = random_vector,
characters = c("A", "B", "C", "D"),
logical_values = c(TRUE, FALSE, TRUE),
matrix = matrix_from_vector
)
cat("List created:\n")

## List created:

print(my_list)

## $numbers
## [1] 91.56580 93.77047 29.32781 83.21431 64.53281 52.39050 73.92224 14.33199
## [9] 66.04224 70.80141 46.31644 72.19211 93.53255 26.28745 46.76699 94.06144
## [17] 97.84442 12.63125 48.02471 56.47294
##
## $characters
## [1] "A" "B" "C" "D"
##
## $logical_values
## [1] TRUE FALSE TRUE
##
## $matrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 91.56580 64.53281 66.04224 93.53255 97.84442
## [2,] 93.77047 52.39050 70.80141 26.28745 12.63125
## [3,] 29.32781 73.92224 46.31644 46.76699 48.02471
## [4,] 83.21431 14.33199 72.19211 94.06144 56.47294

Subsetting the List

Specific elements are extracted from the list using $ notation. This allows accessing individual components
of the list.

7
subset_numeric <- my_list$numbers
cat("Numeric subset of the list:\n", head(subset_numeric),"\n")

## Numeric subset of the list:


## 91.5658 93.77047 29.32781 83.21431 64.53281 52.3905

subset_logical <- my_list$logical_values


cat("Logical subset of the list:\n", subset_logical)

## Logical subset of the list:


## TRUE FALSE TRUE

Modifying List Elements

The second character in the character vector of the list is modified to “Z”. This shows how to update elements
within a list.

my_list$characters[2] <- "Z"


cat("Modified list of characters:\n", my_list$characters)

## Modified list of characters:


## A Z C D

Squaring the Numbers

The numbers in the list are squared using the ˆ operator to demonstrate vectorized operations in R, where
each element is squared individually.

squared_numbers <- my_list$numbersˆ2


cat("Squared numbers:\n", head(squared_numbers))

## Squared numbers:
## 8384.295 8792.9 860.1207 6924.622 4164.483 2744.764

5. Data Frame Operations

Creating a Data Frame

A data frame is created to represent structured tabular data. This data frame contains columns for ID, age,
score, and pass/fail status.

df <- data.frame(
ID = 1:20,
Age = sample(18:65, 20, replace = TRUE),
Score = runif(20, min = 50, max = 100),
Passed = sample(c(TRUE, FALSE), 20, replace = TRUE)
)
cat("Data frame created:\n")

## Data frame created:

8
print(df)

## ID Age Score Passed


## 1 1 64 71.78858 FALSE
## 2 2 20 51.87155 FALSE
## 3 3 58 98.67700 FALSE
## 4 4 42 71.58756 FALSE
## 5 5 44 97.87883 FALSE
## 6 6 53 94.38775 FALSE
## 7 7 54 81.99894 TRUE
## 8 8 48 98.54833 FALSE
## 9 9 62 80.94191 TRUE
## 10 10 22 66.67136 FALSE
## 11 11 37 67.33741 TRUE
## 12 12 51 69.92427 FALSE
## 13 13 45 89.23464 FALSE
## 14 14 57 51.94682 FALSE
## 15 15 20 87.43977 FALSE
## 16 16 50 83.86384 TRUE
## 17 17 59 58.56322 TRUE
## 18 18 41 63.05440 TRUE
## 19 19 47 75.72065 TRUE
## 20 20 60 83.78036 FALSE

Filtering the Data Frame

The data frame is filtered to extract rows where both Age > 30 and Score > 70. This filtering is essential
for focusing on specific subgroups of data.

filtered_df <- subset(df, Age > 30 & Score > 70)


cat("Filtered data frame:\n")

## Filtered data frame:

print(filtered_df)

## ID Age Score Passed


## 1 1 64 71.78858 FALSE
## 3 3 58 98.67700 FALSE
## 4 4 42 71.58756 FALSE
## 5 5 44 97.87883 FALSE
## 6 6 53 94.38775 FALSE
## 7 7 54 81.99894 TRUE
## 8 8 48 98.54833 FALSE
## 9 9 62 80.94191 TRUE
## 13 13 45 89.23464 FALSE
## 16 16 50 83.86384 TRUE
## 19 19 47 75.72065 TRUE
## 20 20 60 83.78036 FALSE

Summary Statistics

Summary statistics (mean, sum, and variance) are calculated for both the Age and Score columns. These
statistics are useful for understanding the central tendency and variability of the data.

9
mean_age <- mean(df$Age)
sum_age <- sum(df$Age)
var_age <- var(df$Age)

mean_score <- mean(df$Score)


sum_score <- sum(df$Score)
var_score <- var(df$Score)

cat("Summary statistics for Age:\n")

## Summary statistics for Age:

cat("Mean:", mean_age, "Sum:", sum_age, "Variance:", var_age, "\n")

## Mean: 46.7 Sum: 934 Variance: 179.6947

cat("Summary statistics for Score:\n")

## Summary statistics for Score:

cat("Mean:", mean_score, "Sum:", sum_score, "Variance:", var_score, "\n")

## Mean: 77.26086 Sum: 1545.217 Variance: 219.2162

Handling Missing Values


To simulate real-world data, missing values are introduced in the Score column. These missing values are
then imputed using the column’s mean.

df$Score[sample(1:20, 5)] <- NA


cat("Data frame with missing values:\n")

## Data frame with missing values:

print(df)

## ID Age Score Passed


## 1 1 64 71.78858 FALSE
## 2 2 20 51.87155 FALSE
## 3 3 58 98.67700 FALSE
## 4 4 42 NA FALSE
## 5 5 44 97.87883 FALSE
## 6 6 53 94.38775 FALSE
## 7 7 54 81.99894 TRUE
## 8 8 48 98.54833 FALSE
## 9 9 62 NA TRUE
## 10 10 22 NA FALSE
## 11 11 37 67.33741 TRUE
## 12 12 51 NA FALSE
## 13 13 45 NA FALSE
## 14 14 57 51.94682 FALSE
## 15 15 20 87.43977 FALSE
## 16 16 50 83.86384 TRUE
## 17 17 59 58.56322 TRUE
## 18 18 41 63.05440 TRUE
## 19 19 47 75.72065 TRUE
## 20 20 60 83.78036 FALSE

10
df$Score[is.na(df$Score)] <- mean(df$Score, na.rm = TRUE)
cat("Data frame after handling missing values:\n")

## Data frame after handling missing values:

print(df)

## ID Age Score Passed


## 1 1 64 71.78858 FALSE
## 2 2 20 51.87155 FALSE
## 3 3 58 98.67700 FALSE
## 4 4 42 77.79050 FALSE
## 5 5 44 97.87883 FALSE
## 6 6 53 94.38775 FALSE
## 7 7 54 81.99894 TRUE
## 8 8 48 98.54833 FALSE
## 9 9 62 77.79050 TRUE
## 10 10 22 77.79050 FALSE
## 11 11 37 67.33741 TRUE
## 12 12 51 77.79050 FALSE
## 13 13 45 77.79050 FALSE
## 14 14 57 51.94682 FALSE
## 15 15 20 87.43977 FALSE
## 16 16 50 83.86384 TRUE
## 17 17 59 58.56322 TRUE
## 18 18 41 63.05440 TRUE
## 19 19 47 75.72065 TRUE
## 20 20 60 83.78036 FALSE

Grouped Statistics

Using the dplyr package, statistics for Age and Score are calculated grouped by the Passed status. This
approach allows for understanding the differences in data across categories.

library(dplyr)
grouped_stats <- df %>%
group_by(Passed) %>%
summarise(
mean_score = mean(Score, na.rm = TRUE),
mean_age = mean(Age)
)
cat("Grouped statistics by Passed status:\n")

## Grouped statistics by Passed status:

print(grouped_stats)

## # A tibble: 2 x 3
## Passed mean_score mean_age
## <lgl> <dbl> <dbl>
## 1 FALSE 80.6 44.9
## 2 TRUE 72.6 50

11
6. Conclusion

This analysis covers essential operations on vectors, matrices, lists, and data frames in R. The operations
performed include sorting, searching, filtering, subsetting, and summarizing data. These methods are foun-
dational for data analysis and statistical modeling in R.

12
Program - 3

Basic Statistical Operations on Open-Source

Datasets

DATE - 25/10/2024

1. Introduction

This analysis performs statistical analysis on two datasets: the Iris dataset and the Palmer Penguins dataset.
Various statistical metrics like mean, median, mode, variance, standard deviation, skewness, and kurtosis are
calculated. Hypothesis testing and data visualization are also performed to explore these datasets further.
The following libraries are loaded:
‘dplyr’ for data manipulation, ‘ggplot2’ for creating visualizations, ‘moments’ for calculating skewness and
kurtosis, ‘palmerpenguins’ for the Palmer Penguins dataset.

library(dplyr)
library(ggplot2)
library(moments)
library(palmerpenguins)

2. Load Datasets

The ‘iris’ and ‘penguins’ datasets are loaded. The ‘iris’ dataset comes from R’s built-in datasets and contains
measurements of sepal and petal lengths and widths for different iris species. The ‘penguins’ dataset provides
measurements of flipper length, body mass, and other characteristics for three penguin species.

data(iris)
data(penguins)

3. Function to Calculate Mode

The mode is defined as the most frequently occurring value in a dataset. A custom function ‘calc_mode()’
is defined to calculate the mode by sorting the frequency of the elements and returning the most frequent
one.

calc_mode <- function(x) {


return(as.numeric(names(sort(table(x), decreasing = TRUE))[1]))
}

4. Statistical Analysis on the Iris Dataset

Mean

The mean is calculated for each of the numeric columns in the ‘iris’ dataset using the ‘sapply()’ function.
The ‘na.rm = TRUE’ argument ensures that missing values are ignored during the calculation.

13
iris_mean <- sapply(iris[, 1:4], mean, na.rm = TRUE)
print(paste("Mean of Iris dataset: ", iris_mean))

## [1] "Mean of Iris dataset: 5.84333333333333"


## [2] "Mean of Iris dataset: 3.05733333333333"
## [3] "Mean of Iris dataset: 3.758"
## [4] "Mean of Iris dataset: 1.19933333333333"

This code calculates the mean of the sepal length, sepal width, petal length, and petal width for all species
in the ‘iris’ dataset

Median

The median is calculated similarly to the mean. The median is the middle value when the data is sorted. If
the dataset has an even number of values, the average of the two middle values is taken.

iris_median <- sapply(iris[, 1:4], median, na.rm = TRUE)


print(paste("Median of Iris dataset: ", iris_median))

## [1] "Median of Iris dataset: 5.8" "Median of Iris dataset: 3"


## [3] "Median of Iris dataset: 4.35" "Median of Iris dataset: 1.3"

This code calculates the median for the four numeric columns in the ‘iris’ dataset.

Mode

The mode is calculated for each numeric column using the ‘calc_mode()’ function defined earlier.

iris_mode <- sapply(iris[, 1:4], calc_mode)


print(paste("Mode of Iris dataset: ", iris_mode))

## [1] "Mode of Iris dataset: 5" "Mode of Iris dataset: 3"


## [3] "Mode of Iris dataset: 1.4" "Mode of Iris dataset: 0.2"

This code calculates the mode for sepal length, sepal width, petal length, and petal width.

Variance

Variance measures the spread of data points. The ‘var(’)’ function is used to compute variance for each
numeric column in the dataset.

iris_variance <- sapply(iris[, 1:4], var, na.rm = TRUE)


print(paste("Variance of Iris dataset: ", iris_variance))

## [1] "Variance of Iris dataset: 0.685693512304251"


## [2] "Variance of Iris dataset: 0.189979418344519"
## [3] "Variance of Iris dataset: 3.11627785234899"
## [4] "Variance of Iris dataset: 0.581006263982103"

This code computes the variance for the four numeric columns in the ‘iris’ dataset.

14
Standard Deviation

The standard deviation is the square root of the variance and provides a measure of the spread of data
points. It is calculated using the ‘sd()’ function.

iris_sd <- sapply(iris[, 1:4], sd, na.rm = TRUE)


print(paste("Standard Deviation of Iris dataset: ", iris_sd))

## [1] "Standard Deviation of Iris dataset: 0.828066127977863"


## [2] "Standard Deviation of Iris dataset: 0.435866284936698"
## [3] "Standard Deviation of Iris dataset: 1.76529823325947"
## [4] "Standard Deviation of Iris dataset: 0.762237668960347"

This code computes the standard deviation for each numeric column in the ‘iris’ dataset.

Skewness

Skewness measures the asymmetry of the distribution of data. The ‘skewness()’ function from the ‘moments’
package is used to compute skewness.

iris_skewness <- sapply(iris[, 1:4], skewness, na.rm = TRUE)


print(paste("Skewness of Iris dataset: ", iris_skewness))

## [1] "Skewness of Iris dataset: 0.311753058502296"


## [2] "Skewness of Iris dataset: 0.315767106338938"
## [3] "Skewness of Iris dataset: -0.272127666456721"
## [4] "Skewness of Iris dataset: -0.101934206565599"

This code calculates the skewness for the numeric columns in the ‘iris’ dataset.

Kurtosis

Kurtosis measures the “tailedness” of the distribution. The ‘kurtosis()’ function is used to compute kurtosis
for each numeric column in the dataset.

iris_kurtosis <- sapply(iris[, 1:4], kurtosis, na.rm = TRUE)


print(paste("Kurtosis of Iris dataset: ", iris_kurtosis))

## [1] "Kurtosis of Iris dataset: -0.573567948924977"


## [2] "Kurtosis of Iris dataset: 0.180976317522469"
## [3] "Kurtosis of Iris dataset: -1.39553588639901"
## [4] "Kurtosis of Iris dataset: -1.33606740523155"

This code calculates the kurtosis for the four numeric columns in the ‘iris’ dataset.

5. Hypothesis Testing

t-test for Sepal Length between Setosa and Versicolor Species

A t-test is performed to compare the Sepal Length between the Setosa and Versicolor species. The ‘t.test()’
function is used to test if the means of two independent groups are different.

15
setosa <- subset(iris, Species == "setosa")$Sepal.Length
versicolor <- subset(iris, Species == "versicolor")$Sepal.Length
t_test <- t.test(setosa, versicolor)
print(t_test)

##
## Welch Two Sample t-test
##
## data: setosa and versicolor
## t = -10.521, df = 86.538, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.1057074 -0.7542926
## sample estimates:
## mean of x mean of y
## 5.006 5.936

This code compares the Sepal Length of Setosa and Versicolor species using a two-sample t-test.

6. Visualization of Iris Dataset

Histogram of Sepal Length


A histogram is created to visualize the distribution of Sepal Length. The ‘geom_histogram()’ function from
‘ggplot2’ is used to plot the histogram with a bin width of 0.3.

ggplot(iris, aes(x = Sepal.Length)) +


geom_histogram(binwidth = 0.3, fill = "blue", color = "black") +
ggtitle("Histogram of Sepal Length in Iris Dataset")

Histogram of Sepal Length in Iris Dataset

20

15
count

10

4 5 6 7 8
Sepal.Length

16
This code creates a histogram to show the distribution of Sepal Length in the ‘iris’ dataset.

Boxplot of Sepal Length by Species

A boxplot is created to show the variation in Sepal Length for each species. The ‘geom_boxplot()’ function
is used to plot the boxplot.

ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +


geom_boxplot() +
ggtitle("Boxplot of Sepal Length by Species in Iris Dataset")

Boxplot of Sepal Length by Species in Iris Dataset


8

Species
Sepal.Length

setosa
6 versicolor
virginica

setosa versicolor virginica


Species

This code creates a boxplot to visualize the Sepal Length distribution across different species in the ‘iris’
dataset.

7. Statistical Analysis on the Palmer Penguins Dataset

Data Cleaning

The ‘na.omit()’ function is used to remove rows with missing values in the ‘penguins’ dataset.

penguins_clean <- na.omit(penguins)

This code removes rows containing missing values in the ‘penguins’ dataset.

Mean

The mean of the numeric columns (flipper length, body mass, etc.) in the ‘penguins’ dataset is calculated.

17
penguins_mean <- sapply(penguins_clean[, 3:6], mean, na.rm = TRUE)
print(paste("Mean of Palmer Penguins dataset: ", penguins_mean))

## [1] "Mean of Palmer Penguins dataset: 43.9927927927928"


## [2] "Mean of Palmer Penguins dataset: 17.1648648648649"
## [3] "Mean of Palmer Penguins dataset: 200.966966966967"
## [4] "Mean of Palmer Penguins dataset: 4207.05705705706"

This code calculates the mean for each numeric column in the ‘penguins’ dataset.

Median

The median is calculated for the numeric columns in the cleaned ‘penguins’ dataset.

penguins_median <- sapply(penguins_clean[, 3:6], median, na.rm = TRUE)


print(paste("Median of Palmer Penguins dataset: ", penguins_median))

## [1] "Median of Palmer Penguins dataset: 44.5"


## [2] "Median of Palmer Penguins dataset: 17.3"
## [3] "Median of Palmer Penguins dataset: 197"
## [4] "Median of Palmer Penguins dataset: 4050"

This code computes the median for the numeric columns in the ‘penguins’ dataset.

Mode

The mode is calculated for the numeric columns in the ‘penguins’ dataset using the ‘calc_mode()’ function.

penguins_mode <- sapply(penguins_clean[, 3:6], calc_mode)


print(paste("Mode of Palmer Penguins dataset: ", penguins_mode))

## [1] "Mode of Palmer Penguins dataset: 41.1"


## [2] "Mode of Palmer Penguins dataset: 17"
## [3] "Mode of Palmer Penguins dataset: 190"
## [4] "Mode of Palmer Penguins dataset: 3800"

This code calculates the mode for the numeric columns in the ‘penguins’ dataset.

Variance

The variance for the numeric columns in the ‘penguins’ dataset is calculated.

penguins_variance <- sapply(penguins_clean[, 3:6], var, na.rm = TRUE)


print(paste("Variance of Palmer Penguins dataset: ", penguins_variance))

## [1] "Variance of Palmer Penguins dataset: 29.9063334418756"


## [2] "Variance of Palmer Penguins dataset: 3.87788830999674"
## [3] "Variance of Palmer Penguins dataset: 196.441676616375"
## [4] "Variance of Palmer Penguins dataset: 648372.487698542"

This code computes the variance for each numeric column in the ‘penguins’ dataset.

18
Standard Deviation

The standard deviation is calculated for each numeric column in the ‘penguins’ dataset.

penguins_sd <- sapply(penguins_clean[, 3:6], sd, na.rm = TRUE)


print(paste("Standard Deviation of Palmer Penguins dataset: ", penguins_sd))

## [1] "Standard Deviation of Palmer Penguins dataset: 5.46866834264756"


## [2] "Standard Deviation of Palmer Penguins dataset: 1.9692354633199"
## [3] "Standard Deviation of Palmer Penguins dataset: 14.0157652882879"
## [4] "Standard Deviation of Palmer Penguins dataset: 805.215801942897"

This code computes the standard deviation for each numeric column in the ‘penguins’ dataset.

Skewness

The skewness for each numeric column in the ‘penguins’ dataset is calculated.

penguins_skewness <- sapply(penguins_clean[, 3:6], skewness, na.rm = TRUE)


print(paste("Skewness of Palmer Penguins dataset: ", penguins_skewness))

## [1] "Skewness of Palmer Penguins dataset: 0.0451359779776739"


## [2] "Skewness of Palmer Penguins dataset: -0.149044996398334"
## [3] "Skewness of Palmer Penguins dataset: 0.358523654622741"
## [4] "Skewness of Palmer Penguins dataset: 0.470116171418382"

This code calculates the skewness for the numeric columns in the ‘penguins’ dataset.

Kurtosis

The kurtosis for each numeric column in the ‘penguins’ dataset is calculated.

penguins_kurtosis <- sapply(penguins_clean[, 3:6], kurtosis, na.rm = TRUE)


print(paste("Kurtosis of Palmer Penguins dataset: ", penguins_kurtosis))

## [1] "Kurtosis of Palmer Penguins dataset: -0.888173414588056"


## [2] "Kurtosis of Palmer Penguins dataset: -0.896587251127618"
## [3] "Kurtosis of Palmer Penguins dataset: -0.964832587409507"
## [4] "Kurtosis of Palmer Penguins dataset: -0.740485880259885"

This code calculates the kurtosis for the numeric columns in the penguins dataset.

8. Hypothesis Testing

t-test for Flipper Length between Adelie and Gentoo Species

A t-test is performed to compare the flipper length between the ‘Adelie’ and ‘Gentoo’ species in the ‘penguins’
dataset.

adelie <- subset(penguins_clean, species == "Adelie")$flipper_length_mm


gentoo <- subset(penguins_clean, species == "Gentoo")$flipper_length_mm
t_test_penguins <- t.test(adelie, gentoo)
print(t_test_penguins)

19
##
## Welch Two Sample t-test
##
## data: adelie and gentoo
## t = -33.506, df = 251.35, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -28.72740 -25.53771
## sample estimates:
## mean of x mean of y
## 190.1027 217.2353

This code compares the flipper length of Adelie and Gentoo species using a two-sample t-test.

9. Visualization of Palmer Penguins Dataset

Histogram of Flipper Length

A histogram is created to visualize the distribution of flipper length in the ‘penguins’ dataset.

ggplot(penguins_clean, aes(x = flipper_length_mm)) +


geom_histogram(binwidth = 3, fill = "green", color = "black") +
ggtitle("Histogram of Flipper Length in Palmer Penguins Dataset")

Histogram of Flipper Length in Palmer Penguins Dataset

30

20
count

10

180 200 220


flipper_length_mm

This code creates a histogram to show the distribution of flipper length in the cleaned ‘penguins’ dataset.

Boxplot of Flipper Length by Species

A boxplot is created to show the variation in flipper length across different species in the ‘penguins’ dataset.

20
ggplot(penguins_clean, aes(x = species, y = flipper_length_mm, fill = species)) +
geom_boxplot() +
ggtitle("Boxplot of Flipper Length by Species in Palmer Penguins Dataset")

Boxplot of Flipper Length by Species in Palmer Penguins Dataset

230

220
flipper_length_mm

210
species
Adelie
200 Chinstrap
Gentoo

190

180

170
Adelie Chinstrap Gentoo
species

This code creates a boxplot to visualize the flipper length distribution across different species.

10. Conclusion

The analysis provides insights into the Iris and Palmer Penguins datasets through various statistical met-
rics and visualizations. These results are useful for understanding the characteristics of the datasets and
performing further exploratory analysis.

21
Program - 4

Data Import, Cleaning, and Export with

Advanced Data Wrangling

DATE - 08/11/2024

1. Introduction

In this analysis, the steps of loading, cleaning, and analyzing two datasets, the Titanic dataset and the
Adult Income dataset, are followed. The process starts with importing the data, followed by cleaning steps
like handling missing values, removing outliers, and generating summary statistics. The cleaned datasets
are then used to explore correlations between variables.

2. Loading Required Libraries

The following libraries are used in the analysis:


‘tidyverse for data manipulation and visualization (e.g., ggplot2, dplyr). ’titanic’ for accessing the Titanic
dataset. ‘dplyr’ for data manipulation tasks like filtering and summarizing. ‘caret’ for machine learning
tasks, including outlier detection. ‘ggcorrplot’ for visualizing correlation matrices.

library(tidyverse)
library(titanic)
library(dplyr)
library(caret)
library(ggcorrplot)

3. Titanic Dataset

Import the Titanic Dataset

The Titanic dataset is imported using the ‘titanic_train’ dataset from the ‘titanic’ library.

data <- titanic::titanic_train

Handling Missing Values

Missing Age Values

Missing values in the ‘Age’ column are replaced with the median value.

data$Age[is.na(data$Age)] <- median(data$Age, na.rm = TRUE)

This step replaces all missing values in the ‘Age’ column with the median of the non-missing values, ensuring
no gaps in the data.

22
Missing Embarked Values

Missing values in the ‘Embarked’ column are replaced with the mode (the most frequent value) of that
column.

mode_embarked <- as.character(names(sort(table(data$Embarked), decreasing = TRUE))[1])


data$Embarked[is.na(data$Embarked)] <- mode_embarked

Here, the most frequent value in the ‘Embarked’ column is determined and used to fill the missing values.

Removing Outliers Using Z-Scores

To remove outliers, the Z-scores are calculated for the numeric columns using the ‘scale()’ function. Any
rows with a Z-score greater than 3 or less than -3 are considered outliers and are removed.

numeric_columns <- sapply(data, is.numeric)


z_scores <- as.data.frame(scale(data[, numeric_columns]))
outlier_condition <- apply(z_scores, 1, function(row) any(abs(row) > 3))
data_clean <- data[!outlier_condition, ]

This code identifies rows with Z-scores outside the range of -3 to 3, which are treated as outliers and removed
from the dataset.

Summarizing the Dataset

Before Cleaning

Summary statistics of the original Titanic dataset are displayed.

summary_before <- summary(titanic::titanic_train)

This step provides an overview of the original dataset, including the minimum, maximum, and median values
of each column.

After Cleaning

Summary statistics of the cleaned Titanic dataset are displayed.

summary_after <- summary(data_clean)

This step provides an overview of the cleaned dataset, helping to confirm that missing values have been
handled and outliers removed.

Correlation Matrix

The correlation matrix of numeric columns in the cleaned Titanic dataset is calculated. This matrix shows
the relationships between the numeric variables.

correlation_matrix <- cor(data_clean[, numeric_columns], use = "complete.obs")

The correlation matrix is used to identify any strong linear relationships between the numeric variables.

23
Exporting the Cleaned Data

The cleaned dataset is saved to a CSV file for future analysis or reporting.

write.csv(data_clean, "cleaned_titanic_data.csv", row.names = FALSE)

his code exports the cleaned dataset to a CSV file, making it available for future use.

Plotting the Correlation Matrix

The correlation matrix is visualized using the ‘ggcorrplot()’ function. The circles represent the correlation
coefficients, with the size of the circle indicating the strength of the correlation.

ggcorrplot(correlation_matrix, method = "circle", lab = TRUE,


title = "Correlation Matrix of Titanic Dataset")

Correlation Matrix of Titanic Dataset

Fare −0.01 0.33 −0.69 0.16 0.25 0.26 1

Parch 0.01 0.2 −0.08 −0.25 0.31 1 0.26


Corr
SibSp −0.05 0.09 −0.05 −0.14 1 0.31 0.25 1.0

0.5
Age 0.01 −0.09 −0.34 1 −0.14 −0.25 0.16
0.0

Pclass −0.03 −0.32 1 −0.34 −0.05 −0.08 −0.69 −0.5

−1.0
Survived −0.01 1 −0.32 −0.09 0.09 0.2 0.33

PassengerId 1 −0.01 −0.03 0.01 −0.05 0.01 −0.01


Id

ss

ch

e
ive

Ag

bS

r
er

Fa
la

r
Pa
ng

Pc

Si
rv
Su
se
s
Pa

This plot visually represents the correlation matrix, making it easier to identify relationships between the
variables.

4. Adult Income Dataset

Importing and Cleaning the Dataset

The Adult Income dataset is imported from a local file path, and missing values are handled. The dataset
contains demographic and economic information, with the goal of predicting whether a person earns more
or less than 50K per year.

24
data <- read.csv("adult.data", header = FALSE)

Assigning the Column Names

The column names for the Adult Income dataset are manually specified to better understand the data.

colnames(data) <- c('age', 'workclass', 'fnlwgt', 'education', 'education_num',


'marital_status', 'occupation', 'relationship', 'race',
'sex', 'capital_gain', 'capital_loss', 'hours_per_week',
'native_country', 'income')

Handling the Missing Data

Replacing ‘?’ Values

Missing values represented as ‘?’ are replaced with NA for easier handling.

data[data == '?'] <- NA

Replacing Missing Categorical Values with the Mode

A custom function ‘replace_mode()’ replaces missing values in categorical columns with the mode of that
column.

replace_mode <- function(x) {


mode_value <- as.character(names(sort(table(x), decreasing = TRUE))[1])
x[is.na(x)] <- mode_value
return(x)
}
data <- data %>%
mutate_if(is.character, replace_mode)

The ‘replace_mode()’ function replaces missing values in the categorical columns with the most frequent
value (mode).

Replacing Missing Numeric Values with the Median

For numeric columns, missing values are replaced with the median value of that column.

data <- data %>%


mutate_if(is.numeric, ~ ifelse(is.na(.), median(., na.rm = TRUE), .))

This step ensures that missing values in numeric columns are replaced with the median of the respective
columns.

Remove Outliers from the Adult Dataset

Outliers in the Adult Income dataset are removed using Z-scores, similar to the Titanic dataset.

25
remove_outliers <- function(x) {
z_scores <- scale(x)
x[abs(z_scores) <= 3]
}

numeric_columns <- sapply(data, is.numeric)

data_clean <- data %>%


filter(!apply(as.data.frame(scale(data[, numeric_columns])), 1,
function(row) any(abs(row) > 3)))

This code detects outliers in the numeric columns and removes any rows with Z-scores outside the range of
-3 to 3.

Summarizing the Dataset

Before Cleaning

Summary statistics of the original Adult Income dataset are displayed.

summary_before <- summary(read.csv("adult.data", header = FALSE))

After Cleaning

Summary statistics of the cleaned Adult Income dataset are displayed.

summary_after <- summary(data_clean)

Calculating and Plotting the Correlation Matrix

The correlation matrix is calculated for the numeric columns in the cleaned Adult Income dataset and
visualized using ‘ggcorrplot()’.

correlation_matrix <- cor(data_clean[, numeric_columns], use = "complete.obs")


write.csv(data_clean, "cleaned_adult_income_data.csv", row.names = FALSE)
ggcorrplot(correlation_matrix, method = "circle", lab = TRUE,
title = "Correlation Matrix of Adult Income Dataset")

26
Correlation Matrix of Adult Income Dataset

hours_per_week 0.09 −0.02 0.15 0.1 0 1

capital_loss 0.02 0 0.01 −0.01 1 0


Corr
1.0

capital_gain 0.13 0 0.15 1 −0.01 0.1 0.5

0.0
education_num 0.04 −0.04 1 0.15 0.01 0.15
−0.5

fnlwgt −0.07 1 −0.04 0 0 −0.02 −1.0

age 1 −0.07 0.04 m 0.13 0.02 0.09

k
in

ss
e

gt

ee
ag

ga
nu
lw

lo

w
l_
fn

l_
n_

r_
ta
ta
tio

e
pi
pi

_p
a

ca
ca
uc

u rs
ed

ho

This step visualizes the correlations between the numeric columns and saves the cleaned dataset to a CSV
file for future analysis.

5. Conclusion

The analysis demonstrates how to clean and analyze the Titanic and Adult Income datasets. Missing values
are handled, outliers are removed, and the correlation between variables is visualized. The cleaned datasets
are now ready for further exploration or modeling.

27
Program - 5

Advanced Data Manipulation with dplyr and

Complex Grouping

DATE - 08/11/2024

1. Introduction

In this analysis, the Star Wars dataset and the Flights dataset from the nycflights13 package are
explored. Various operations such as filtering, grouping, and summarizing data are performed, along with
visualizations created using ggplot2. Techniques for joining data frames, calculating rolling averages, and
cumulative sums are also demonstrated.

2. Loading the Required Libraries

The following libraries are loaded for this analysis:

• dplyr for data manipulation.


• nycflights13 for accessing the datasets.
• ggplot2 for data visualization.
• zoo for calculating rolling averages.

library(dplyr)
library(nycflights13)
library(ggplot2)
library(zoo)

3. Star Wars Dataset

3.1 Load the Dataset

The starwars dataset contains information about characters in the Star Wars universe. The dataset is
loaded, and the head() function is used to preview the first few rows.

data("starwars")
head(starwars)

## # A tibble: 6 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sky~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
## 3 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
## 4 Darth Va~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia Org~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen Lars 178 120 brown, gr~ light blue 52 male mascu~
## # i 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>

28
3.2 Filter and Arrange the Data

This step involves:

• Selecting relevant columns: name, species, height, and mass.


• Filtering out rows with missing species or height values and keeping only characters with a height
greater than 100 cm.
• Arranging the data in descending order of height.

starwars_filtered <- starwars %>%


dplyr::select(name, species, height, mass) %>%
filter(!is.na(species) & !is.na(height) & height > 100) %>%
arrange(desc(height))

head(starwars_filtered)

## # A tibble: 6 x 4
## name species height mass
## <chr> <chr> <int> <dbl>
## 1 Yarael Poof Quermian 264 NA
## 2 Tarfful Wookiee 234 136
## 3 Lama Su Kaminoan 229 88
## 4 Chewbacca Wookiee 228 112
## 5 Roos Tarpals Gungan 224 82
## 6 Grievous Kaleesh 216 159

3.3 Visualize Height of Star Wars Characters

A bar plot is created to visualize the height of Star Wars characters:

• Characters are reordered by height in descending order.


• The fill aesthetic is used to color the bars by species.
• coord_flip() flips the coordinates to make the plot horizontal.

ggplot(starwars_filtered, aes(x = reorder(name, -height), y = height, fill = species)) +


geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Height of Star Wars Characters", x = "Character", y = "Height (cm)") +
theme_minimal()

29
Height of Star Wars Characters species
Sebulba
Gasgano Besalisk Nautolan
Mon Watto
Mothma
Leia
NienOrgana
Nunb Cerean Neimodian
ShmiQuadinaros
Ben Skywalker
Beru Whitesun Dormé
Lars
Barriss Offee Chagrian Pau'an
Jocasta
C−3PO Nu
ZamAntilles
Wedge Wesell Clawdite Quermian
Luminara Palpatine
Unduli
FinisEeth
Valorum
Koth
Luke Skywalker Droid Rodian
Greedo
Lobot
Jabba Desilijic Tiure
LandoDarth Maul
Calrissian
Shaak Ti
Dug Skakoan
Owen Lars
Ayla Secura
Wilhuff Tarkin Geonosian Sullustan
BibHan Solo
Character

Fortuna
Ackbar
Obi−WanRic Kenobi
Olié Gungan Tholothian
Quarsh
Poggle thePanaka
Lesser
Jango
Cliegg Fett Human Togruta
BobaLars
Biggs DarklighterFett
Padmé Adi Gallia
Amidala Hutt Toong
Saesee
Raymus Tiin
Antilles
Mace PloWindu
Koon
Anakin Skywalker Iktotchi Toydarian
Bossk
Nute San
Bail Prestor Gunray
Organa
Hill
Kaleesh Trandoshan
Wat Tambor
Qui−Gon Jinn
Dooku
Mas Amedda
KitBinks
Fisto Kaminoan Twi'lek
Jar Jar
Ki−Adi−Mundi
Dexter Jettster
IG−88 Kel Dor Wookiee
Darth
Tion Vader
Medon
Rugor
Taun Nass
We
Grievous Mirialan Xexto
Roos Tarpals
Chewbacca
Lama Su
YaraelTarfful
Poof Mon Calamari Zabrak
0 100 200 Muun
Height (cm)

3.4 Grouping and Summarizing Data

Group by Species and Summarize Data

The data is grouped by species, and for each group, the following is calculated:

• Average height (avg_height).


• Average mass (avg_mass).
• Count of characters (count).

The results are arranged in descending order of count to show the species with the most characters.

species_summary <- starwars %>%


group_by(species) %>%
summarize(
avg_height = mean(height, na.rm = TRUE),
avg_mass = mean(mass, na.rm = TRUE),
count = n()
) %>%
arrange(desc(count))

head(species_summary)

## # A tibble: 6 x 4
## species avg_height avg_mass count
## <chr> <dbl> <dbl> <int>
## 1 Human 178 81.3 35
## 2 Droid 131. 69.8 6
## 3 <NA> 175 81 4

30
## 4 Gungan 209. 74 3
## 5 Kaminoan 221 88 2
## 6 Mirialan 168 53.1 2

Visualize Average Height by Species

A bar plot is created to show the average height for each species:

• Species are ordered by average height in descending order.


• Bars are colored according to the species.

ggplot(species_summary, aes(x = reorder(species, -avg_height), y = avg_height,


fill = species)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Average Height by Species", x = "Species", y = "Average Height (cm)") +
theme_minimal()
species
Average Height by Species Aleena Nautolan
NA Besalisk Neimodian
Yoda's species
Aleena Cerean Pau'an
Ewok
Vulptereen
Dug Chagrian Quermian
Xexto
Droid Clawdite Rodian
Toydarian
Sullustan
Toong Droid Skakoan
Mirialan
Clawdite Dug Sullustan
Zabrak
Rodian
Hutt Ewok Tholothian
Togruta
Species

Human Geonosian Togruta


Twi'lek
Mon Calamari Gungan Toong
Geonosian
Tholothian
Kel Dor Human Toydarian
Iktotchi
Trandoshan Hutt Trandoshan
Neimodian
Muun
Skakoan Iktotchi Twi'lek
Nautolan
Chagrian Kaleesh Vulptereen
Cerean
Besalisk
Pau'an Kaminoan Wookiee
Gungan
Kaleesh Kel Dor Xexto
Kaminoan
Wookiee
Quermian Mirialan Yoda's species
0 100 200 Mon Calamari Zabrak
Average Height (cm)
Muun NA

3.5 Classify Characters Based on Height

Create a New Column for Height Category

A new column, height_category, is created using mutate() to classify characters as “Tall” if their height
is greater than 180 cm, and “Short” otherwise.

starwars_classified <- starwars %>%


mutate(height_category = ifelse(height > 180, "Tall", "Short"))

head(starwars_classified)

31
## # A tibble: 6 x 15
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sky~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
## 3 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
## 4 Darth Va~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia Org~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen Lars 178 120 brown, gr~ light blue 52 male mascu~
## # i 6 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>, height_category <chr>

Visualize Height Categories

A bar plot is created to show the distribution of characters classified into “Tall” and “Short” categories.

ggplot(starwars_classified, aes(x = height_category, fill = height_category)) +


geom_bar() +
labs(title = "Distribution of Height Categories", x = "Height Category", y = "Count") +
theme_minimal()

Distribution of Height Categories

40

30

height_category
Count

Short
20 Tall
NA

10

Short Tall NA
Height Category

4. Flights and Airlines Datasets

4.1 Data Import

Inner Join Flights and Airlines

The flights and airlines datasets are joined on the carrier column using an inner join. This keeps only
rows where there is a matching carrier in both datasets.

32
data("flights")
data("airlines")

flights_inner_join <- flights %>%


inner_join(airlines, by = "carrier")

head(flights_inner_join)

## # A tibble: 6 x 20
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # i 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, name <chr>

Full Outer Join Flights and Airlines

A full outer join is performed to merge all rows from both datasets, filling in missing values where no match
is found.

flights_outer_join <- flights %>%


full_join(airlines, by = "carrier")

head(flights_outer_join)

## # A tibble: 6 x 20
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # i 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, name <chr>

4.2 Calculate Rolling Averages and Cumulative Sums

Calculate Rolling Average and Cumulative Delay

The flights dataset is sorted by year, month, and day. A 5-period rolling average of arr_delay is
calculated using zoo::rollmean(), and the cumulative sum of arr_delay is calculated using cumsum().

flights_rolling <- flights %>%


arrange(year, month, day) %>%
mutate(
rolling_avg_delay = zoo::rollmean(arr_delay, 5, fill = NA),
cumulative_delay = cumsum(ifelse(is.na(arr_delay), 0, arr_delay))
)

33
Handle Missing Values and Recalculate

Missing values in arr_delay are replaced with 0 before recalculating the rolling average and cumulative
delay.

flights_rolling <- flights %>%


arrange(year, month, day) %>%
mutate(
arr_delay = ifelse(is.na(arr_delay), 0, arr_delay),
rolling_avg_delay = rollmean(arr_delay, 5, fill = NA),
cumulative_delay = cumsum(arr_delay)
)

Visualize Rolling Average and Cumulative Delay

This plot visualizes the rolling average delay and cumulative delay for flights:

• The rolling average delay is shown in blue.


• The cumulative delay (scaled by 1000) is shown in red.

ggplot(flights_rolling, aes(x = day)) +


geom_line(aes(y = rolling_avg_delay, color = "Rolling Average Delay")) +
geom_line(aes(y = cumulative_delay / 1000, color = "Cumulative Delay (x1000)")) +
labs(title = "Rolling Average and Cumulative Delay of Flights", x = "Day of the Month",
y = "Delay (minutes)") +
scale_color_manual(values = c("Rolling Average Delay" = "blue",
"Cumulative Delay (x1000)" = "red")) +
theme_minimal()

Rolling Average and Cumulative Delay of Flights

2000

1500
Delay (minutes)

colour
Cumulative Delay (x1000)
1000
Rolling Average Delay

500

0 10 20 30
Day of the Month

34
5. Conclusion

This analysis explores the starwars and flights datasets. Data manipulation tasks such as filtering,
grouping, and summarizing data are performed, and ggplot2 is used to create meaningful visualizations.
Techniques for joining datasets and calculating rolling averages and cumulative sums are also demonstrated.

35
Program - 6

Data Visualization with ggplot2 and

Customizations

DATE - 15/11/2024

1. Introduction

This analysis focuses on creating advanced visualizations using the ggplot2 package in R. The datasets
include mpg and diamonds, and the visualizations showcase scatter plots, faceted plots, heatmaps, and
annotated graphs. Customizations such as themes, annotations, and color palettes enhance the visual
appeal and interpretability.

2. Loading Required Libraries

The following libraries are loaded for this analysis: - ggplot2: For data visualization. - reshape2: For
reshaping data to create heatmaps. - dplyr: For data manipulation.

library(ggplot2)
library(reshape2)
library(dplyr)

3. Scatter Plot with Regression Line

This section creates a scatter plot to analyze the relationship between engine displacement (displ) and
highway miles per gallon (hwy). A regression line with confidence intervals is added for further insights.

ggplot(mpg, aes(x = displ, y = hwy, color = class)) +


geom_point(size = 3, alpha = 0.7) +
geom_smooth(method = "lm", se = TRUE, linetype = "dashed", color = "black", size = 1) +
labs(
title = "Scatter Plot of Engine Displacement vs Highway MPG with Regression Line",
x = "Engine Displacement (L)",
y = "Highway Miles per Gallon",
color = "Vehicle Class"
) +
theme_bw() +
theme(
plot.title = element_text(hjust = 0.5, size = 12, face = "bold"),
axis.title = element_text(size = 12),
legend.position = "bottom"
)

36
Scatter Plot of Engine Displacement vs Highway MPG with Regression Line

40
Highway Miles per Gallon

30

20

10

2 3 4 5 6 7
Engine Displacement (L)

2seater midsize pickup suv


Vehicle Class
compact minivan subcompact

The scatter plot includes: - Points representing individual observations. - A regression line (dashed) showing
the trend. - Confidence intervals around the regression line.

4. Faceted Scatter Plot

Faceted scatter plots allow data to be visualized separately by vehicle class. This approach highlights
class-specific patterns.

ggplot(mpg, aes(x = displ, y = hwy)) +


geom_point(color = "darkgreen", size = 2) +
facet_wrap(~ class, ncol = 3) +
labs(
title = "Faceted Scatter Plot by Vehicle Class",
x = "Engine Displacement (L)",
y = "Highway Miles per Gallon"
) +
theme_minimal() +
theme(
strip.text = element_text(size = 12, face = "italic"),
plot.title = element_text(hjust = 0.5, size = 16)
)

37
Faceted Scatter Plot by Vehicle Class
2seater compact midsize
40

30

20
Highway Miles per Gallon

minivan pickup subcompact


40

30

20

2 3 4 5 6 7 2 3 4 5 6 7
suv
40

30

20

2 3 4 5 6 7
Engine Displacement (L)

Each facet represents a vehicle class, making it easier to compare relationships within and across classes.

5. Heatmap of Correlation Matrix

This section visualizes correlations among numeric variables in the diamonds dataset using a heatmap.

data("diamonds")
cor_matrix <- cor(diamonds[, sapply(diamonds, is.numeric)], use = "complete.obs")
cor_data <- melt(cor_matrix)

ggplot(cor_data, aes(Var1, Var2, fill = value)) +


geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0,
limit = c(-1, 1), space = "Lab", name = "Correlation") +
labs(
title = "Heatmap of Correlation Matrix for Diamonds Dataset",
x = "Variables",
y = "Variables"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 12),
axis.text.y = element_text(size = 12),
plot.title = element_text(hjust = 0.5, size = 16)
)

38
Heatmap of Correlation Matrix for Diamonds Dataset

y
Correlation
x 1.0
Variables

0.5
price
0.0

−0.5
table
−1.0

depth

carat
t

z
ra

pt

bl

ic
ca

pr
ta
de

Variables
The heatmap shows: - Strong correlations in red. - Weak correlations in white. - Negative correlations in
blue.

6. Customized Scatter Plot

This section demonstrates a scatter plot with enhanced aesthetics, including a specific color palette and
additional customization.

ggplot(mpg, aes(x = displ, y = hwy, color = class)) +


geom_point(size = 3, shape = 21, fill = "lightblue", alpha = 0.8) +
theme_light() +
scale_color_brewer(palette = "Set2") +
labs(
title = "Customized Scatter Plot with Aesthetic Enhancements",
x = "Engine Displacement (L)",
y = "Highway Miles per Gallon",
color = "Class"
) +
theme(
plot.title = element_text(face = "bold", size = 18),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
legend.background = element_rect(fill = "gray90")
)

39
Customized Scatter Plot with Aesthetic Enhancements

40
Highway Miles per Gallon

Class
2seater
compact
30
midsize
minivan
pickup
subcompact
20 suv

2 3 4 5 6 7
Engine Displacement (L)
The customized scatter plot uses: - Brewer palette for color differentiation. - Improved readability with
light theme and bold title.

7. Annotated Scatter Plot

Annotations are added to highlight specific regions in the data.

annotated_plot <- ggplot(mpg, aes(x = displ, y = hwy)) +


geom_point(color = "purple", size = 3) +
annotate("text", x = 4, y = 40, label = "High Efficiency Zone", color = "red",
size = 5, fontface = "bold", angle = 15) +
annotate("rect", xmin = 2, xmax = 4, ymin = 30, ymax = 45, alpha = 0.2,
fill = "yellow", color = "orange") +
labs(
title = "Annotated Scatter Plot with Highlighted Zone",
x = "Engine Displacement (L)",
y = "Highway Miles per Gallon"
) +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5))

ggsave("annotated_scatter_plot_expanded.png", annotated_plot, width = 10,


height = 8, dpi = 300)

The plot highlights a specific area with a rectangle and adds a label for context. The ggsave function saves
the plot as a PNG file.

40
8. Conclusion

This analysis showcases advanced visualization techniques using ggplot2. Techniques include scatter plots,
faceted plots, heatmaps, and annotated graphs. Customizations improve interpretability and presentation
quality.

41
Program - 7

Linear and Multiple Regression Analysis with

Interaction Terms

DATE - 22/11/2024

1. Introduction

This analysis uses the Boston dataset to explore the relationships between housing prices and various
predictor variables. The primary goal is to build and evaluate regression models, including simple and
multiple linear regression, to understand the factors influencing median home values (medv). Additionally,
a classification approach is implemented to assess housing value categories using logistic regression.

2. Load Libraries and Data

library(MASS)
library(ggplot2)
library(caret)
library(car)
library(pROC)
library(dplyr)
library(corrplot)
data("Boston")

The required libraries are loaded to enable data manipulation, visualization, and model building. The
Boston dataset is used for regression analysis, containing information about housing values in Boston.

3. Preprocessing

sum(is.na(Boston))

## [1] 0

summary(Boston)

## crim zn indus chas


## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795

42
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00

boxplot(Boston$medv, main = "Boxplot of Median Value of Homes (medv)")

Boxplot of Median Value of Homes (medv)


50
40
30
20
10

Boston <- Boston %>% filter(medv < 50)

This section checks for missing values and provides a statistical summary of the data. A boxplot visualizes
the target variable medv, and potential outliers (homes with medv equal to 50) are removed to improve model
accuracy.

4. Feature Selection

43
cor_matrix <- cor(Boston)
corrplot(cor_matrix, method = "circle")

ptratio

medv
indus

black
chas
crim

lstat
age
nox

rad
tax
dis
rm
zn
1
crim
zn 0.8

indus
0.6
chas
0.4
nox
rm 0.2
age
0
dis
rad −0.2

tax
−0.4
ptratio
−0.6
black
lstat −0.8
medv
−1

The correlation matrix between numerical features is calculated and visualized using corrplot. Features
such as lstat (lower status of the population) and rm (average number of rooms per dwelling) are identified
as strong predictors of medv based on their correlations.

5. Simple Linear Regression

simple_model <- lm(medv ~ lstat, data = Boston)


summary(simple_model)

##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.992 -3.313 -0.941 1.914 21.246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.54041 0.48150 67.58 <2e-16 ***
## lstat -0.84374 0.03268 -25.82 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##

44
## Residual standard error: 5.119 on 488 degrees of freedom
## Multiple R-squared: 0.5774, Adjusted R-squared: 0.5765
## F-statistic: 666.6 on 1 and 488 DF, p-value: < 2.2e-16

A simple linear regression model is fitted with lstat as the predictor for medv. The coefficients and p-values
are interpreted to assess the significance and relationship between the variables.

6. Multiple Linear Regression with Interaction

multiple_model <- lm(medv ~ lstat * rm, data = Boston)


summary(multiple_model)

##
## Call:
## lm(formula = medv ~ lstat * rm, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.3064 -2.4982 -0.3056 1.8635 18.4779
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.99970 3.07045 -8.468 2.98e-16 ***
## lstat 1.97178 0.17761 11.102 < 2e-16 ***
## rm 9.01216 0.46519 19.373 < 2e-16 ***
## lstat:rm -0.43817 0.02976 -14.723 < 2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 3.845 on 486 degrees of freedom
## Multiple R-squared: 0.7625, Adjusted R-squared: 0.761
## F-statistic: 520 on 3 and 486 DF, p-value: < 2.2e-16

A multiple linear regression model is fitted using lstat, rm, and their interaction term. The adjusted R-
squared is analyzed to measure the model’s explanatory power. Interaction terms help understand how rm
modifies the effect of lstat on medv.

7. Model Performance Evaluation

adjusted_R2 <- summary(multiple_model)$adj.r.squared


AIC_value <- AIC(multiple_model)
BIC_value <- BIC(multiple_model)

cat("Adjusted Rˆ2:", adjusted_R2, "\n")

## Adjusted R^2: 0.7609872

cat("AIC:", AIC_value, "\n")

## AIC: 2716.448

45
cat("BIC:", BIC_value, "\n")

## BIC: 2737.42

Model performance metrics such as adjusted R-squared, AIC, and BIC are calculated and printed. These
metrics help compare models by evaluating their goodness of fit and complexity.

8. Model Diagnostics: Residual Analysis

plot(multiple_model, which = 1, main = "Residuals vs Fitted Plot")

Residuals vs Fitted Plot


Residuals vs Fitted
20

355

178
10
Residuals

0
−10
−20

354

10 20 30 40

Fitted values
lm(medv ~ lstat * rm)

plot(multiple_model, which = 2, main = "Normal Q-Q Plot")

46
Normal Q−Q Plot
Q−Q Residuals
6

355
4
Standardized residuals

178
2
0
−2
−4

354
−6

−3 −2 −1 0 1 2 3

Theoretical Quantiles
lm(medv ~ lstat * rm)
Residual diagnostics are performed to check assumptions of linear regression. The residuals vs. fitted plot
evaluates homoscedasticity and linearity, while the Q-Q plot assesses the normality of residuals.

9. Cross-Validation for Model Accuracy

set.seed(123)
train_control <- trainControl(method = "cv", number = 10)
cv_model <- train(medv ~ lstat * rm, data = Boston, method = "lm",
trControl = train_control)
print(cv_model)

## Linear Regression
##
## 490 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 441, 441, 442, 441, 441, 440, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 3.855714 0.7597212 2.868657
##
## Tuning parameter ’intercept’ was held constant at a value of TRUE

Ten-fold cross-validation is used to evaluate the predictive performance of the multiple linear regression
model. The cross-validated RMSE provides an estimate of the model’s accuracy on unseen data.

47
10. ROC Curve Analysis (Classification Approach)

Boston$medv_class <- ifelse(Boston$medv >= 25, 1, 0)


logistic_model <- glm(medv_class ~ lstat * rm, data = Boston, family = "binomial")
summary(logistic_model)

##
## Call:
## glm(formula = medv_class ~ lstat * rm, family = "binomial", data = Boston)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -19.80238 7.45022 -2.658 0.00786 **
## lstat -0.02301 0.75334 -0.031 0.97563
## rm 3.29921 1.12727 2.927 0.00343 **
## lstat:rm -0.03989 0.11492 -0.347 0.72851
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 536.34 on 489 degrees of freedom
## Residual deviance: 254.70 on 486 degrees of freedom
## AIC: 262.7
##
## Number of Fisher Scoring iterations: 8

pred_probs <- predict(logistic_model, type = "response")


roc_curve <- roc(Boston$medv_class, pred_probs)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc_curve, main = "ROC Curve for Logistic Regression Model", col = "blue")
abline(a = 0, b = 1, lty = 2, col = "red")

48
ROC Curve for Logistic Regression Model
1.0
0.8
0.6
Sensitivity
0.4
0.2
0.0

1.0 0.5 0.0


Specificity

cat("AUC:", auc(roc_curve), "\n")

## AUC: 0.9392864

The regression problem is converted into a binary classification task by creating a new variable medv_class,
where values of medv greater than or equal to 25 are classified as 1 and others as 0. A logistic regression model
is fitted with interaction terms. The ROC curve and AUC are used to evaluate the model’s classification
performance.

11. Conclusion

The analysis demonstrates that a multiple linear regression model incorporating interaction terms between
lstat and rm provides a better fit compared to a simple linear regression model. The model’s performance
is assessed through adjusted R-squared, AIC, and BIC values, and its predictive accuracy is evaluated using
cross-validation. The assumptions of linear regression are reasonably satisfied based on residual diagnostics.
Additionally, a logistic regression model effectively classifies homes based on their median value, with the
ROC curve indicating satisfactory discriminatory power.

49
Program - 8

K-Means Clustering and PCA for

Dimensionality Reduction

DATE - 22/11/2024

1. Introduction

The analysis involves performing clustering on two datasets: the Wine dataset and the Breast Cancer
dataset. Principal Component Analysis (PCA) is used for dimensionality reduction, followed by k-means
clustering for group identification. Methods such as the elbow method and silhouette analysis are employed
to determine the optimal number of clusters. The objective is to explore patterns in the datasets and
visualize clustering results.

2. Wine Dataset Analysis

2.1 Load Libraries and Data

library(ggplot2)
library(cluster)
library(factoextra)

normalize <- function(data) {


return((data - min(data)) / (max(data) - min(data)))
}

wine <- wine


wine_data <- wine[, -1]
wine_norm <- as.data.frame(lapply(wine_data, normalize))

The required libraries are loaded, and the Wine dataset is prepared by normalizing its features to scale
values between 0 and 1.

2.2 Principal Component Analysis (PCA)

wine_pca <- prcomp(wine_norm, scale. = TRUE)


summary(wine_pca)

## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.169 1.5802 1.2025 0.95863 0.92370 0.80103 0.74231
## Proportion of Variance 0.362 0.1921 0.1112 0.07069 0.06563 0.04936 0.04239
## Cumulative Proportion 0.362 0.5541 0.6653 0.73599 0.80162 0.85098 0.89337
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 0.59034 0.53748 0.5009 0.47517 0.41082 0.32152
## Proportion of Variance 0.02681 0.02222 0.0193 0.01737 0.01298 0.00795
## Cumulative Proportion 0.92018 0.94240 0.9617 0.97907 0.99205 1.00000

50
wine_pca_data <- as.data.frame(wine_pca$x[, 1:2])

PCA is applied to the normalized dataset, and the first two principal components are extracted for clustering.

2.3 Determine Optimal Number of Clusters

elbow_wine <- fviz_nbclust(wine_pca_data, kmeans, method = "wss")


print(elbow_wine)

Optimal number of clusters

1250
Total Within Sum of Square

1000

750

500

250

1 2 3 4 5 6 7 8 9 10
Number of clusters k

silhouette_wine <- fviz_nbclust(wine_pca_data, kmeans, method = "silhouette")


print(silhouette_wine)

51
Average silhouette width Optimal number of clusters

0.4

0.2

0.0
1 2 3 4 5 6 7 8 9 10
Number of clusters k
The elbow method and silhouette analysis are utilized to identify the optimal number of clusters for k-means
clustering.

2.4 K-Means Clustering

set.seed(123)
wine_kmeans <- kmeans(wine_pca_data, centers = 3, nstart = 25)

wine_pca_data$cluster <- as.factor(wine_kmeans$cluster)

p1 <- ggplot(wine_pca_data, aes(x = PC1, y = PC2, color = cluster)) +


geom_point(size = 3) +
labs(title = "K-Means Clustering on Wine Dataset")
print(p1)

52
K−Means Clustering on Wine Dataset
4

cluster
1
PC2

0 2
3

−2

−2.5 0.0 2.5


PC1

cat("Wine Dataset Clustering Results:\n")

## Wine Dataset Clustering Results:

cat("Cluster Sizes: ", wine_kmeans$size, "\n")

## Cluster Sizes: 64 65 49

K-means clustering is performed with the optimal number of clusters, and the results are visualized using a
scatter plot. Cluster sizes are reported.

3. Breast Cancer Dataset Analysis

3.1 Load and Normalize Data

bc_data <- read.csv(


"https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data",
header = FALSE)
bc_features <- bc_data[, -c(1, 2)]
bc_norm <- as.data.frame(lapply(bc_features, normalize))

The Breast Cancer dataset is loaded, and features are normalized to scale values between 0 and 1.

3.2 Principal Component Analysis (PCA)

53
bc_pca <- prcomp(bc_norm, scale. = TRUE)
summary(bc_pca)

## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.6444 2.3857 1.67867 1.40735 1.28403 1.09880 0.82172
## Proportion of Variance 0.4427 0.1897 0.09393 0.06602 0.05496 0.04025 0.02251
## Cumulative Proportion 0.4427 0.6324 0.72636 0.79239 0.84734 0.88759 0.91010
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.69037 0.6457 0.59219 0.5421 0.51104 0.49128 0.39624
## Proportion of Variance 0.01589 0.0139 0.01169 0.0098 0.00871 0.00805 0.00523
## Cumulative Proportion 0.92598 0.9399 0.95157 0.9614 0.97007 0.97812 0.98335
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.30681 0.28260 0.24372 0.22939 0.22244 0.17652 0.1731
## Proportion of Variance 0.00314 0.00266 0.00198 0.00175 0.00165 0.00104 0.0010
## Cumulative Proportion 0.98649 0.98915 0.99113 0.99288 0.99453 0.99557 0.9966
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.16565 0.15602 0.1344 0.12442 0.09043 0.08307 0.03987
## Proportion of Variance 0.00091 0.00081 0.0006 0.00052 0.00027 0.00023 0.00005
## Cumulative Proportion 0.99749 0.99830 0.9989 0.99942 0.99969 0.99992 0.99997
## PC29 PC30
## Standard deviation 0.02736 0.01153
## Proportion of Variance 0.00002 0.00000
## Cumulative Proportion 1.00000 1.00000

bc_pca_data <- as.data.frame(bc_pca$x[, 1:2])

PCA is applied to the normalized dataset, and the first two principal components are extracted for clustering.

3.3 Determine Optimal Number of Clusters

elbow_bc <- fviz_nbclust(bc_pca_data, kmeans, method = "wss")


print(elbow_bc)

54
Optimal number of clusters

9000
Total Within Sum of Square

6000

3000

1 2 3 4 5 6 7 8 9 10
Number of clusters k

silhouette_bc <- fviz_nbclust(bc_pca_data, kmeans, method = "silhouette")


print(silhouette_bc)

Optimal number of clusters

0.5

0.4
Average silhouette width

0.3

0.2

0.1

0.0
1 2 3 4 5 6 7 8 9 10
Number of clusters k
The elbow method and silhouette analysis are used to determine the optimal number of clusters for k-means

55
clustering.

3.4 K-Means Clustering

set.seed(123)
bc_kmeans <- kmeans(bc_pca_data, centers = 2, nstart = 25)

bc_pca_data$cluster <- as.factor(bc_kmeans$cluster)

p2 <- ggplot(bc_pca_data, aes(x = PC1, y = PC2, color = cluster)) +


geom_point(size = 3) +
labs(title = "K-Means Clustering on Breast Cancer Dataset")
print(p2)

K−Means Clustering on Breast Cancer Dataset

0
cluster
PC2

1
2
−5

−10

−15 −10 −5 0 5
PC1

cat("Breast Cancer Dataset Clustering Results:\n")

## Breast Cancer Dataset Clustering Results:

cat("Cluster Sizes: ", bc_kmeans$size, "\n")

## Cluster Sizes: 191 378

K-means clustering is performed with the optimal number of clusters. Results are visualized using a scatter
plot, and cluster sizes are reported.

56
4. Conclusion

The analysis successfully demonstrated the use of PCA for dimensionality reduction and k-means clustering
for grouping data points. For the Wine dataset, three clusters were identified, while the Breast Cancer
dataset exhibited two clusters. The elbow and silhouette methods guided the determination of the optimal
cluster numbers, and visualizations provided insights into the clustering patterns. These techniques showcase
how dimensionality reduction and clustering can reveal structure in complex datasets.

57
Program - 9

Time Series Analysis using ARIMA and

Seasonal Decomposition

DATE - 20/12/2024

1. Introduction

This document demonstrates how to perform time series analysis using two datasets: AirPassengers
(monthly international airline passenger counts from 1949 to 1960) and Monthly Milk Production (milk
production per cow from 1962 to 1975). The analysis focuses on the following key aspects:

1. Exploratory Data Analysis (EDA): Understanding the structure and properties of the datasets.

2. Decomposition: Breaking down the series into its components—trend, seasonality, and residuals.

3. Model Fitting: Building ARIMA and SARIMA models to predict future values.

4. Forecast Evaluation: Comparing the accuracy of the models’ forecasts.

5. Visualization: Highlighting the results through graphs to aid in interpretation.

The goal is to compare ARIMA and SARIMA models, assess their performance, and understand their
suitability for time series forecasting.

2. Loading Libraries

This section imports essential libraries for time series modeling and visualization.

• forecast: Provides tools for ARIMA/SARIMA modeling and forecasting.

• ggplot2: Used for creating advanced visualizations.

• TSA: Offers statistical tools specific to time series analysis.

• tseries: Includes functions for hypothesis testing, such as the Augmented Dickey-Fuller (ADF) test.

# Load required libraries for time series analysis and modeling


library(forecast)
library(ggplot2)
library(TSA)
library(tseries)

3. Defining Utility Functions

Utility functions simplify repetitive tasks such as analyzing the data, decomposing time series, building
models, and comparing forecasts.

58
3.1 Exploratory Data Analysis (EDA)

This function performs a detailed EDA, including statistical summaries and visualizations. Key components:

• Summary Statistics: Provides measures like mean, median, and range, giving a quick overview of
the dataset’s properties.

• Plots: Visualizes the time series to observe trends, seasonality, or irregularities.

• ACF and PACF Plots: Help identify dependencies in the series, aiding in ARIMA model parameter
selection.

# Function to perform Exploratory Data Analysis (EDA) on the time series data
perform_eda <- function(ts_data, dataset_name) {
cat("Exploratory Data Analysis for ", dataset_name, "\n")
print(summary(ts_data))
plot(ts_data, main = paste(dataset_name, "Time Series"), ylab = "Values", xlab = "Time")
cat("ACF and PACF plots:\n")
acf(ts_data, main = paste("ACF of", dataset_name))
pacf(ts_data, main = paste("PACF of", dataset_name))
}

3.2 Decomposing Time Series

Decomposition is crucial for understanding how different components contribute to the series.

• Trend: Long-term increase or decrease in the data.

• Seasonality: Regular, repeating patterns within a fixed period.

• Residuals: Random noise or unexplained variations.

This function splits the time series into these components and visualizes them for further analysis.

# Function to decompose the time series into trend, seasonal, and residual components
decompose_ts <- function(ts_data, dataset_name) {
cat("Decomposing the time series for ", dataset_name, "\n")
decomposition <- decompose(ts_data)
plot(decomposition)
return(decomposition)
}

3.3 Fitting ARIMA Model

The ARIMA model is used for non-seasonal data. This function:

1. Tests for stationarity using the ADF test.

2. Differentiates non-stationary data to make it stationary.

3. Automatically selects ARIMA parameters (p, d, q) using auto.arima.

4. Fits the ARIMA model and generates a forecast.

59
# Function to fit an ARIMA model to the time series data
fit_arima <- function(ts_data, dataset_name) {
cat("Fitting ARIMA model for ", dataset_name, "\n")
adf_test <- adf.test(ts_data, alternative = "stationary")
cat("ADF Test p-value:", adf_test$p.value, "\n")
if (adf_test$p.value > 0.05) {
ts_data <- diff(ts_data)
plot(ts_data, main = paste(dataset_name, "Differenced Time Series"))
}
auto_model <- auto.arima(ts_data, seasonal = FALSE)
print(summary(auto_model))
forecast_result <- forecast(auto_model, h = 12)
plot(forecast_result, main = paste(dataset_name, "ARIMA Forecast"))
return(auto_model)
}

3.4 Fitting SARIMA Model

SARIMA extends ARIMA by incorporating seasonality. It uses seasonal differencing to handle periodic
patterns.
This function automatically selects SARIMA parameters (P, D, Q) and fits the model to the data.

# Function to fit a Seasonal ARIMA (SARIMA) model to the time series data
fit_sarima <- function(ts_data, dataset_name) {
cat("Fitting SARIMA model for ", dataset_name, "\n")
auto_sarima <- auto.arima(ts_data, seasonal = TRUE)
print(summary(auto_sarima))
sarima_forecast <- forecast(auto_sarima, h = 12)
plot(sarima_forecast, main = paste(dataset_name, "SARIMA Forecast"))
return(auto_sarima)
}

3.5 Comparing Models

This function calculates and compares the accuracy of ARIMA and SARIMA forecasts using metrics like
Root Mean Squared Error (RMSE).

# Function to compare ARIMA and SARIMA models by evaluating forecast accuracy


compare_models <- function(arima_model, sarima_model, ts_data) {
cat("Comparing ARIMA and SARIMA models:\n")
h <- min(12, length(ts_data))
arima_forecast <- forecast(arima_model, h = h)
sarima_forecast <- forecast(sarima_model, h = h)
actual_values <- ts_data[(length(ts_data) - h + 1):length(ts_data)]
arima_accuracy <- accuracy(arima_forecast$mean, actual_values)
sarima_accuracy <- accuracy(sarima_forecast$mean, actual_values)
cat("ARIMA Forecast Accuracy:\n", arima_accuracy)
cat("SARIMA Forecast Accuracy:\n", sarima_accuracy)
}

3.6 Plotting Forecast Comparison

This function creates a visualization comparing actual vs. forecasted values for ARIMA and SARIMA models.
The model with lower RMSE is highlighted with a better color.

60
# Function to visualize the comparison of ARIMA and SARIMA forecast performance
plot_forecast_comparison <- function(actual_values, arima_forecast, sarima_forecast,
time_points) {
arima_rmse <- sqrt(mean((arima_forecast - actual_values)ˆ2))
sarima_rmse <- sqrt(mean((sarima_forecast - actual_values)ˆ2))
better_color <- ifelse(arima_rmse < sarima_rmse, "green", "red")
worse_color <- ifelse(arima_rmse < sarima_rmse, "red", "green")
plot(time_points, actual_values, type = "o", col = "blue", pch = 16, lty = 1,
xlab = "Time", ylab = "Values", main = "Forecast Comparison")
lines(time_points, arima_forecast, col = better_color, lty = 2, lwd = 2)
lines(time_points, sarima_forecast, col = worse_color, lty = 3, lwd = 2)
legend("topright", legend = c("Actual Values", paste("ARIMA (RMSE =",
round(arima_rmse, 2), ")"), paste("SARIMA (RMSE =",
round(sarima_rmse, 2), ")")), col = c("blue", better_color,
worse_color), lty = c(1, 2, 3), lwd = c(1, 2, 2),
pch = c(16, NA, NA))
}

plot_forecast_comparison1 <- function(actual_values, arima_forecast, sarima_forecast,


time_points) {
arima_rmse <- sqrt(mean((arima_forecast - actual_values)ˆ2))
sarima_rmse <- sqrt(mean((sarima_forecast - actual_values)ˆ2))
better_color <- ifelse(arima_rmse < sarima_rmse, "green", "red")
worse_color <- ifelse(arima_rmse < sarima_rmse, "red", "green")
plot(time_points, actual_values, type = "o", col = "blue", pch = 16, lty = 1,
xlab = "Time", ylab = "Values", main = "Forecast Comparison")
lines(time_points, arima_forecast, col = better_color, lty = 2, lwd = 2)
lines(time_points, sarima_forecast, col = worse_color, lty = 3, lwd = 2)
legend("topleft", legend = c("Actual Values", paste("ARIMA (RMSE =",
round(arima_rmse, 2), ")"), paste("SARIMA (RMSE =",
round(sarima_rmse, 2), ")")), col = c("blue", better_color,
worse_color), lty = c(1, 2, 3), lwd = c(1, 2, 2),
pch = c(16, NA, NA))
}

4. Dataset Analysis

4.1 AirPassengers Dataset

The AirPassengers dataset records monthly international airline passenger numbers from 1949 to 1960.
Below is the analysis conducted on this dataset:

# AirPassengers Dataset Analysis


data("AirPassengers")
air_data <- AirPassengers
perform_eda(air_data, "AirPassengers")

## Exploratory Data Analysis for AirPassengers


## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 104.0 180.0 265.5 280.3 360.5 622.0

61
AirPassengers Time Series
600
500
400
Values

300
200
100

1950 1952 1954 1956 1958 1960

Time

## ACF and PACF plots:

ACF of AirPassengers
0.8
0.6
ACF

0.4
0.2
−0.2 0.0

0.5 1.0 1.5

Lag

62
PACF of AirPassengers
1.0
0.5
Partial ACF

0.0
−0.5

0.5 1.0 1.5

Lag

decompose_ts(air_data, "AirPassengers")

## Decomposing the time series for AirPassengers

Decomposition of additive time series


observed
400
100
450
trend
300
40 150
random seasonal
0
−40
40
0
−40

1950 1952 1954 1956 1958 1960

Time

63
## $x
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432
##
## $seasonal
## Jan Feb Mar Apr May Jun
## 1949 -24.748737 -36.188131 -2.241162 -8.036616 -4.506313 35.402778
## 1950 -24.748737 -36.188131 -2.241162 -8.036616 -4.506313 35.402778
## 1951 -24.748737 -36.188131 -2.241162 -8.036616 -4.506313 35.402778
## 1952 -24.748737 -36.188131 -2.241162 -8.036616 -4.506313 35.402778
## 1953 -24.748737 -36.188131 -2.241162 -8.036616 -4.506313 35.402778
## 1954 -24.748737 -36.188131 -2.241162 -8.036616 -4.506313 35.402778
## 1955 -24.748737 -36.188131 -2.241162 -8.036616 -4.506313 35.402778
## 1956 -24.748737 -36.188131 -2.241162 -8.036616 -4.506313 35.402778
## 1957 -24.748737 -36.188131 -2.241162 -8.036616 -4.506313 35.402778
## 1958 -24.748737 -36.188131 -2.241162 -8.036616 -4.506313 35.402778
## 1959 -24.748737 -36.188131 -2.241162 -8.036616 -4.506313 35.402778
## 1960 -24.748737 -36.188131 -2.241162 -8.036616 -4.506313 35.402778
## Jul Aug Sep Oct Nov Dec
## 1949 63.830808 62.823232 16.520202 -20.642677 -53.593434 -28.619949
## 1950 63.830808 62.823232 16.520202 -20.642677 -53.593434 -28.619949
## 1951 63.830808 62.823232 16.520202 -20.642677 -53.593434 -28.619949
## 1952 63.830808 62.823232 16.520202 -20.642677 -53.593434 -28.619949
## 1953 63.830808 62.823232 16.520202 -20.642677 -53.593434 -28.619949
## 1954 63.830808 62.823232 16.520202 -20.642677 -53.593434 -28.619949
## 1955 63.830808 62.823232 16.520202 -20.642677 -53.593434 -28.619949
## 1956 63.830808 62.823232 16.520202 -20.642677 -53.593434 -28.619949
## 1957 63.830808 62.823232 16.520202 -20.642677 -53.593434 -28.619949
## 1958 63.830808 62.823232 16.520202 -20.642677 -53.593434 -28.619949
## 1959 63.830808 62.823232 16.520202 -20.642677 -53.593434 -28.619949
## 1960 63.830808 62.823232 16.520202 -20.642677 -53.593434 -28.619949
##
## $trend
## Jan Feb Mar Apr May Jun Jul Aug
## 1949 NA NA NA NA NA NA 126.7917 127.2500
## 1950 131.2500 133.0833 134.9167 136.4167 137.4167 138.7500 140.9167 143.1667
## 1951 157.1250 159.5417 161.8333 164.1250 166.6667 169.0833 171.2500 173.5833
## 1952 183.1250 186.2083 189.0417 191.2917 193.5833 195.8333 198.0417 199.7500
## 1953 215.8333 218.5000 220.9167 222.9167 224.0833 224.7083 225.3333 225.3333
## 1954 228.0000 230.4583 232.2500 233.9167 235.6250 237.7500 240.5000 243.9583
## 1955 261.8333 266.6667 271.1250 275.2083 278.5000 281.9583 285.7500 289.3333
## 1956 309.9583 314.4167 318.6250 321.7500 324.5000 327.0833 329.5417 331.8333
## 1957 348.2500 353.0000 357.6250 361.3750 364.5000 367.1667 369.4583 371.2083
## 1958 375.2500 377.9167 379.5000 380.0000 380.7083 380.9583 381.8333 383.6667
## 1959 402.5417 407.1667 411.8750 416.3333 420.5000 425.5000 430.7083 435.1250
## 1960 456.3333 461.3750 465.2083 469.3333 472.7500 475.0417 NA NA
## Sep Oct Nov Dec
## 1949 127.9583 128.5833 129.0000 129.7500

64
## 1950 145.7083 148.4167 151.5417 154.7083
## 1951 175.4583 176.8333 178.0417 180.1667
## 1952 202.2083 206.2500 210.4167 213.3750
## 1953 224.9583 224.5833 224.4583 225.5417
## 1954 247.1667 250.2500 253.5000 257.1250
## 1955 293.2500 297.1667 301.0000 305.4583
## 1956 334.4583 337.5417 340.5417 344.0833
## 1957 372.1667 372.4167 372.7500 373.6250
## 1958 386.5000 390.3333 394.7083 398.6250
## 1959 437.7083 440.9583 445.8333 450.6250
## 1960 NA NA NA NA
##
## $random
## Jan Feb Mar Apr May Jun
## 1949 NA NA NA NA NA NA
## 1950 8.4987374 29.1047980 8.3244949 6.6199495 -7.9103535 -25.1527778
## 1951 12.6237374 26.6464646 18.4078283 6.9116162 9.8396465 -26.4861111
## 1952 12.6237374 29.9797980 6.1994949 -2.2550505 -6.0770202 -13.2361111
## 1953 4.9154040 13.6881313 17.3244949 20.1199495 9.4229798 -17.1111111
## 1954 0.7487374 -6.2702020 4.9911616 1.1199495 2.8813131 -9.1527778
## 1955 4.9154040 2.5214646 -1.8838384 1.8282828 -3.9936869 -2.3611111
## 1956 -1.2095960 -1.2285354 0.6161616 -0.7133838 -1.9936869 11.5138889
## 1957 -8.5012626 -15.8118687 0.6161616 -5.3383838 -4.9936869 19.4305556
## 1958 -10.5012626 -23.7285354 -15.2588384 -23.9633838 -13.2020202 18.6388889
## 1959 -17.7929293 -28.9785354 -3.6338384 -12.2967172 4.0063131 11.0972222
## 1960 -14.5845960 -34.1868687 -43.9671717 -0.2967172 3.7563131 24.5555556
## Jul Aug Sep Oct Nov Dec
## 1949 -42.6224747 -42.0732323 -8.4785354 11.0593434 28.5934343 16.8699495
## 1950 -34.7474747 -35.9898990 -4.2285354 5.2260101 16.0517677 13.9116162
## 1951 -36.0808081 -37.4065657 -7.9785354 5.8093434 21.5517677 14.4532828
## 1952 -31.8724747 -20.5732323 -9.7285354 5.3926768 15.1767677 9.2449495
## 1953 -25.1641414 -16.1565657 -4.4785354 7.0593434 9.1351010 4.0782828
## 1954 -2.3308081 -13.7815657 -4.6868687 -0.6073232 3.0934343 0.4949495
## 1955 14.4191919 -5.1565657 2.2297980 -2.5239899 -10.4065657 1.1616162
## 1956 19.6275253 10.3434343 4.0214646 -10.8989899 -15.9482323 -9.4633838
## 1957 31.7108586 32.9684343 15.3131313 -4.7739899 -14.1565657 -9.0050505
## 1958 45.3358586 58.5101010 0.9797980 -10.6906566 -31.1148990 -33.0050505
## 1959 53.4608586 61.0517677 8.7714646 -13.3156566 -30.2398990 -17.0050505
## 1960 NA NA NA NA NA NA
##
## $figure
## [1] -24.748737 -36.188131 -2.241162 -8.036616 -4.506313 35.402778
## [7] 63.830808 62.823232 16.520202 -20.642677 -53.593434 -28.619949
##
## $type
## [1] "additive"
##
## attr(,"class")
## [1] "decomposed.ts"

arima_air <- fit_arima(air_data, "AirPassengers")

## Fitting ARIMA model for AirPassengers

## ADF Test p-value: 0.01


## Series: ts_data
## ARIMA(4,1,2) with drift
##
## Coefficients:

65
## ar1 ar2 ar3 ar4 ma1 ma2 drift
## 0.2243 0.3689 -0.2567 -0.2391 -0.0971 -0.8519 2.6809
## s.e. 0.1047 0.1147 0.0985 0.0919 0.0866 0.0877 0.1711
##
## sigma^2 = 706.3: log likelihood = -670.07
## AIC=1356.15 AICc=1357.22 BIC=1379.85
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -1.228696 25.82793 20.59211 -1.665245 7.476447 0.6428946
## ACF1
## Training set 0.0009861078

AirPassengers ARIMA Forecast


600
500
400
300
200
100

1950 1952 1954 1956 1958 1960 1962

sarima_air <- fit_sarima(air_data, "AirPassengers")

## Fitting SARIMA model for AirPassengers


## Series: ts_data
## ARIMA(2,1,1)(0,1,0)[12]
##
## Coefficients:
## ar1 ar2 ma1
## 0.5960 0.2143 -0.9819
## s.e. 0.0888 0.0880 0.0292
##
## sigma^2 = 132.3: log likelihood = -504.92
## AIC=1017.85 AICc=1018.17 BIC=1029.35
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 1.3423 10.84619 7.86754 0.420698 2.800458 0.245628 -0.00124847

66
AirPassengers SARIMA Forecast
100 200 300 400 500 600 700

1950 1952 1954 1956 1958 1960 1962

compare_models(arima_air, sarima_air, air_data)

## Comparing ARIMA and SARIMA models:


## ARIMA Forecast Accuracy:
## -23.06901 83.58979 74.65149 -7.376947 16.02863SARIMA Forecast Accuracy:
## -31.66945 31.70666 31.66945 -6.784721 6.784721

Forecasting and Plot Comparison for AirPassengers Dataset

The forecasting comparison evaluates both ARIMA and SARIMA models by comparing their forecasts to
actual values.

h_air <- 12
air_actual_values <- air_data[(length(air_data) - h_air + 1):length(air_data)]
arima_air_forecast <- forecast(arima_air, h = h_air)$mean
sarima_air_forecast <- forecast(sarima_air, h = h_air)$mean
time_points_air <- time(air_data)[(length(air_data) - h_air + 1):length(air_data)]
plot_forecast_comparison1(air_actual_values, arima_air_forecast, sarima_air_forecast,
time_points_air)

67
Forecast Comparison

Actual Values
600

ARIMA (RMSE = 83.59 )


SARIMA (RMSE = 31.71 )
550
Values

500
450
400

1960.0 1960.2 1960.4 1960.6 1960.8

Time

4.2 Monthly Milk Production Dataset

The Monthly Milk Production dataset contains monthly records of milk production per cow from 1962
to 1975. Below is the analysis conducted on this dataset:

# Monthly Milk Production Dataset Analysis


data(milk)
milk_data <- milk
perform_eda(milk_data, "Monthly Milk Production")

## Exploratory Data Analysis for Monthly Milk Production


## milk
## Min. :1236
## 1st Qu.:1420
## Median :1504
## Mean :1504
## 3rd Qu.:1588
## Max. :1760

68
Monthly Milk Production Time Series
1700
Values

1500
1300

1994 1996 1998 2000 2002 2004 2006

Time

## ACF and PACF plots:

ACF of Monthly Milk Production


0.6
0.4
ACF

0.2
0.0
−0.2

0.5 1.0 1.5

Lag

69
PACF of Monthly Milk Production
0.6
Partial ACF

0.2
−0.2
−0.6

0.5 1.0 1.5

Lag

decompose_ts(milk_data, "Monthly Milk Production")

## Decomposing the time series for Monthly Milk Production

Decomposition of additive time series


observed
1600
1300
1550
trend
1400
random seasonal
50
−50
20
−20

1994 1996 1998 2000 2002 2004 2006

Time

70
## $x
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1994 1343 1236 1401 1396 1457 1388 1389 1369 1318 1354 1312 1370
## 1995 1404 1295 1453 1427 1484 1421 1414 1375 1331 1364 1320 1380
## 1996 1415 1348 1469 1441 1479 1398 1400 1382 1342 1391 1350 1418
## 1997 1433 1328 1500 1474 1529 1471 1473 1446 1377 1416 1369 1438
## 1998 1466 1347 1515 1501 1556 1477 1468 1443 1386 1446 1407 1489
## 1999 1518 1404 1585 1554 1610 1516 1498 1487 1445 1491 1459 1538
## 2000 1579 1506 1632 1593 1636 1547 1561 1525 1464 1511 1459 1519
## 2001 1549 1431 1599 1571 1632 1555 1552 1520 1472 1522 1485 1549
## 2002 1591 1472 1654 1621 1678 1587 1578 1570 1497 1539 1496 1575
## 2003 1615 1489 1666 1627 1671 1596 1597 1571 1511 1561 1517 1596
## 2004 1624 1531 1661 1636 1692 1607 1623 1601 1533 1583 1531 1610
## 2005 1643 1522 1707 1690 1760 1690 1683 1671 1599 1637 1592 1663
##
## $seasonal
## Jan Feb Mar Apr May Jun Jul
## 1994 25.16919 -82.90657 75.61237 45.65783 97.34343 16.80934 12.89646
## 1995 25.16919 -82.90657 75.61237 45.65783 97.34343 16.80934 12.89646
## 1996 25.16919 -82.90657 75.61237 45.65783 97.34343 16.80934 12.89646
## 1997 25.16919 -82.90657 75.61237 45.65783 97.34343 16.80934 12.89646
## 1998 25.16919 -82.90657 75.61237 45.65783 97.34343 16.80934 12.89646
## 1999 25.16919 -82.90657 75.61237 45.65783 97.34343 16.80934 12.89646
## 2000 25.16919 -82.90657 75.61237 45.65783 97.34343 16.80934 12.89646
## 2001 25.16919 -82.90657 75.61237 45.65783 97.34343 16.80934 12.89646
## 2002 25.16919 -82.90657 75.61237 45.65783 97.34343 16.80934 12.89646
## 2003 25.16919 -82.90657 75.61237 45.65783 97.34343 16.80934 12.89646
## 2004 25.16919 -82.90657 75.61237 45.65783 97.34343 16.80934 12.89646
## 2005 25.16919 -82.90657 75.61237 45.65783 97.34343 16.80934 12.89646
## Aug Sep Oct Nov Dec
## 1994 -13.32323 -71.29293 -27.92929 -73.19066 -4.84596
## 1995 -13.32323 -71.29293 -27.92929 -73.19066 -4.84596
## 1996 -13.32323 -71.29293 -27.92929 -73.19066 -4.84596
## 1997 -13.32323 -71.29293 -27.92929 -73.19066 -4.84596
## 1998 -13.32323 -71.29293 -27.92929 -73.19066 -4.84596
## 1999 -13.32323 -71.29293 -27.92929 -73.19066 -4.84596
## 2000 -13.32323 -71.29293 -27.92929 -73.19066 -4.84596
## 2001 -13.32323 -71.29293 -27.92929 -73.19066 -4.84596
## 2002 -13.32323 -71.29293 -27.92929 -73.19066 -4.84596
## 2003 -13.32323 -71.29293 -27.92929 -73.19066 -4.84596
## 2004 -13.32323 -71.29293 -27.92929 -73.19066 -4.84596
## 2005 -13.32323 -71.29293 -27.92929 -73.19066 -4.84596
##
## $trend
## Jan Feb Mar Apr May Jun Jul Aug
## 1994 NA NA NA NA NA NA 1363.625 1368.625
## 1995 1384.042 1385.333 1386.125 1387.083 1387.833 1388.583 1389.458 1392.125
## 1996 1393.917 1393.625 1394.375 1395.958 1398.333 1401.167 1403.500 1403.417
## 1997 1421.208 1426.917 1431.042 1433.542 1435.375 1437.000 1439.208 1441.375
## 1998 1448.208 1447.875 1448.125 1449.750 1452.583 1456.292 1460.583 1465.125
## 1999 1486.750 1489.833 1494.125 1498.458 1502.500 1506.708 1511.292 1518.083
## 2000 1536.875 1541.083 1543.458 1545.083 1545.917 1545.125 1543.083 1538.708
## 2001 1530.958 1530.375 1530.500 1531.292 1532.833 1535.167 1538.167 1541.625
## 2002 1559.667 1562.833 1565.958 1567.708 1568.875 1570.417 1572.500 1574.208
## 2003 1577.375 1578.208 1578.833 1580.333 1582.125 1583.875 1585.125 1587.250
## 2004 1593.083 1595.417 1597.583 1599.417 1600.917 1602.083 1603.458 1603.875
## 2005 1626.917 1632.333 1638.000 1643.000 1647.792 1652.542 NA NA
## Sep Oct Nov Dec
## 1994 1373.250 1376.708 1379.125 1381.625

71
## 1995 1395.000 1396.250 1396.625 1395.458
## 1996 1403.875 1406.542 1410.000 1415.125
## 1997 1442.792 1444.542 1446.792 1448.167
## 1998 1470.417 1475.542 1480.000 1483.875
## 1999 1524.292 1527.875 1530.583 1532.958
## 2000 1534.208 1531.917 1530.833 1531.000
## 2001 1545.625 1550.000 1554.000 1557.250
## 2002 1575.417 1576.167 1576.125 1576.208
## 2003 1588.792 1588.958 1590.208 1591.542
## 2004 1605.417 1609.583 1614.667 1620.958
## 2005 NA NA NA NA
##
## $random
## Jan Feb Mar Apr May
## 1994 NA NA NA NA NA
## 1995 -5.21085859 -7.42676768 -8.73737374 -5.74116162 -1.17676768
## 1996 -4.08585859 37.28156566 -0.98737374 -0.61616162 -16.67676768
## 1997 -13.37752525 -16.01010101 -6.65404040 -5.19949495 -3.71843434
## 1998 -7.37752525 -17.96843434 -8.73737374 5.59217172 6.07323232
## 1999 6.08080808 -2.92676768 15.26262626 9.88383838 10.15656566
## 2000 16.95580808 47.82323232 12.92929293 2.25883838 -7.26010101
## 2001 -7.12752525 -16.46843434 -7.11237374 -5.94949495 1.82323232
## 2002 6.16414141 -7.92676768 12.42929293 7.63383838 11.78156566
## 2003 12.45580808 -6.30176768 11.55429293 1.00883838 -8.46843434
## 2004 5.74747475 18.48989899 -12.19570707 -9.07449495 -6.26010101
## 2005 -9.08585859 -27.42676768 -6.61237374 1.34217172 14.86489899
## Jun Jul Aug Sep Oct
## 1994 NA 12.47853535 13.69823232 16.04292929 5.22095960
## 1995 15.60732323 11.64520202 -3.80176768 7.29292929 -4.32070707
## 1996 -19.97601010 -16.39646465 -8.09343434 9.41792929 12.38762626
## 1997 17.19065657 20.89520202 17.94823232 5.50126263 -0.61237374
## 1998 3.89898990 -5.47979798 -8.80176768 -13.12373737 -1.61237374
## 1999 -7.51767677 -26.18813131 -17.76010101 -7.99873737 -8.94570707
## 2000 -14.93434343 5.02020202 -0.38510101 1.08459596 7.01262626
## 2001 3.02398990 0.93686869 -8.30176768 -2.33207071 -0.07070707
## 2002 -0.22601010 -7.39646465 9.11489899 -7.12373737 -9.23737374
## 2003 -4.68434343 -1.02146465 -2.92676768 -6.49873737 -0.02904040
## 2004 -11.89267677 6.64520202 10.44823232 -1.12373737 1.34595960
## 2005 20.64898990 NA NA NA NA
## Nov Dec
## 1994 6.06565657 -6.77904040
## 1995 -3.43434343 -10.61237374
## 1996 13.19065657 7.72095960
## 1997 -4.60101010 -5.32070707
## 1998 0.19065657 9.97095960
## 1999 1.60732323 9.88762626
## 2000 1.35732323 -7.15404040
## 2001 4.19065657 -3.40404040
## 2002 -6.93434343 3.63762626
## 2003 -0.01767677 9.30429293
## 2004 -10.47601010 -6.11237374
## 2005 NA NA
##
## $figure
## [1] 25.16919 -82.90657 75.61237 45.65783 97.34343 16.80934 12.89646
## [8] -13.32323 -71.29293 -27.92929 -73.19066 -4.84596
##
## $type
## [1] "additive"

72
##
## attr(,"class")
## [1] "decomposed.ts"

arima_milk <- fit_arima(milk_data, "Monthly Milk Production")

## Fitting ARIMA model for Monthly Milk Production

## ADF Test p-value: 0.01


## Series: ts_data
## ARIMA(2,1,1)
##
## Coefficients:
## ar1 ar2 ma1
## 0.2066 0.3330 -0.9109
## s.e. 0.0879 0.0869 0.0336
##
## sigma^2 = 3373: log likelihood = -782.65
## AIC=1573.29 AICc=1573.58 BIC=1585.14
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 10.57461 57.26184 47.71957 0.5838515 3.166587 1.536337 0.002171886

Monthly Milk Production ARIMA Forecast


1700
1500
1300

1994 1996 1998 2000 2002 2004 2006

sarima_milk <- fit_sarima(milk_data, "Monthly Milk Production")

## Fitting SARIMA model for Monthly Milk Production


## Series: ts_data
## ARIMA(1,0,0)(2,1,2)[12] with drift

73
##
## Coefficients:
## ar1 sar1 sar2 sma1 sma2 drift
## 0.8638 0.0607 -0.4074 -1.0121 0.4831 2.1882
## s.e. 0.0475 0.1862 0.1173 0.1994 0.1881 0.2174
##
## sigma^2 = 137.9: log likelihood = -518.84
## AIC=1051.67 AICc=1052.57 BIC=1071.85
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.1211196 10.98512 8.342375 -0.01115387 0.5520753 0.2685838
## ACF1
## Training set -0.08850265

Monthly Milk Production SARIMA Forecast


1700
1500
1300

1994 1996 1998 2000 2002 2004 2006

compare_models(arima_milk, sarima_milk, milk_data)

## Comparing ARIMA and SARIMA models:


## ARIMA Forecast Accuracy:
## 21.88452 64.7399 54.04997 1.188661 3.264899SARIMA Forecast Accuracy:
## -17.81697 28.77242 19.52245 -1.093511 1.195402

Forecasting and Plot Comparison for Milk Production Dataset

The forecasting comparison evaluates both ARIMA and SARIMA models by comparing their forecasts to
actual values.

h_milk <- 12
milk_actual_values <- milk_data[(length(milk_data) - h_milk + 1):length(milk_data)]

74
arima_milk_forecast <- forecast(arima_milk, h = h_milk)$mean
sarima_milk_forecast <- forecast(sarima_milk, h = h_milk)$mean
time_points_milk <- time(milk_data)[(length(milk_data) - h_milk + 1):length(milk_data)]
plot_forecast_comparison(milk_actual_values, arima_milk_forecast, sarima_milk_forecast,
time_points_milk)

Forecast Comparison
1550 1600 1650 1700 1750

Actual Values
ARIMA (RMSE = 64.74 )
SARIMA (RMSE = 28.77 )
Values

2005.0 2005.2 2005.4 2005.6 2005.8

Time

5. Conclusion

This document presented a comprehensive analysis of the AirPassengers and Monthly Milk Produc-
tion datasets using ARIMA and SARIMA models. The analysis included exploratory data visualization,
decomposition, model fitting, and forecasting. Key insights are summarized as follows:

• ARIMA Models: Effective for capturing non-seasonal trends and patterns. However, they might
underperform for datasets with strong seasonal components.
• SARIMA Models: Demonstrated superior performance in handling seasonal variations, especially
evident in datasets like AirPassengers and Monthly Milk Production.
• Model Comparison: The comparison of forecasting accuracy metrics such as RMSE highlighted the
relative strengths of ARIMA and SARIMA models for each dataset.

The results underscore the importance of selecting models that align with the inherent characteristics of the
time series data. While ARIMA offers simplicity and robustness for non-seasonal data, SARIMA excels in
datasets where seasonality plays a significant role. Future work could explore hybrid approaches or advanced
techniques like machine learning to enhance forecasting accuracy.

75
Program - 10

Interactive Visualization with plotly and

Dynamic Reports with RMarkdown

DATE - 20/12/2024

1. Introduction
This report explores the ‘gapminder’ dataset and creates interactive visualizations using the ‘plotly’ library.
The dataset provides valuable insights into global trends in life expectancy, GDP per capita, and population
across countries and continents over time. By leveraging interactive plots, the analysis becomes more
engaging and accessible for users. Finally, these visualizations are integrated into an interactive dashboard
for a comprehensive exploration experience..

2. Load Required Libraries


The analysis makes use of the ‘plotly’ library for creating interactive visualizations, the ‘gapminder’ library
for accessing the dataset, and the ‘dplyr’ library for data manipulation. These libraries provide the necessary
tools to filter, transform, and visualize the data effectively.

# Load necessary libraries


library(plotly)
library(gapminder)
library(dplyr)
library(ggplot2)

3. Load the dataset

data("gapminder")

4. ‘Gapminder’ Dataset Overview


The ‘gapminder’ dataset contains data on life expectancy, GDP per capita, and population for countries
across the world, spanning multiple years. It provides a unique opportunity to analyze historical trends and
compare economic and health indicators across different regions. Below are the first few rows of the dataset:

# Display a preview of the dataset


head(gapminder)

## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.

76
5. Scatter Plot: GDP vs Life Expectancy by Continent

This visualization shows the relationship between GDP per capita and life expectancy, with the data grouped
by continent. It helps uncover how wealth (GDP per capita) correlates with health outcomes (life expectancy)
across regions. The size of the points represents the population of each country. Hovering over the points
reveals additional details, such as the country name and its GDP per capita.

# Scatter plot of GDP vs Life Expectancy by Continent


scatter_plot <- ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent,
size = pop)) + geom_point(alpha = 0.7) +
scale_x_log10() + # Log scale for better visualization
labs(title = "GDP vs Life Expectancy by Continent",
x = "GDP per Capita (Log Scale)",
y = "Life Expectancy") +
theme_minimal()

scatter_plot

GDP vs Life Expectancy by Continent

80 pop
2.50e+08
5.00e+08
7.50e+08
1.00e+09
Life Expectancy

60
1.25e+09

continent
Africa

40 Americas
Asia
Europe
Oceania

1e+03 1e+04 1e+05


GDP per Capita (Log Scale)

6. Bar Chart: Life Expectancy by Country in 2007

This bar chart highlights the life expectancy of each country for the year 2007. It allows for quick identifi-
cation of countries with the highest and lowest life expectancies in the dataset. The bars are labeled with
the country name and its life expectancy, making it easier to compare values across countries.

# Filter for year 2007 and create a bar chart of life expectancy by country
bar_chart <- gapminder %>%
filter(year == 2007) %>%
arrange(desc(lifeExp)) %>%
ggplot(aes(x = reorder(country, lifeExp), y = lifeExp, fill = continent)) +

77
geom_bar(stat = "identity") +
coord_flip() + # Flip for better readability
labs(title = "Life Expectancy by Country in 2007",
x = "Country",
y = "Life Expectancy") +
theme_minimal()

bar_chart

Life Expectancy by Country in 2007


Hong Kong, Japan
China
Iceland
Switzerland
Australia
Spain
Sweden
Israel
France
New CanadaItaly
Zealand
Norway
Singapore
Austria
Netherlands
Greece
Belgium
United Germany
Kingdom
Finland
Costa
Puerto Ireland
Rica
Rico
Korea, Rep.
Chile
Taiwan
UnitedDenmark
Cuba
States
Portugal
Slovenia
Kuwait
Czech Republic
Reunion
Albania
Uruguay
Mexico
Croatia
Oman
Bahrain
Poland
Panama
Argentina
Ecuador
Bosnia and Herzegovina
Slovak Republic
Montenegro continent
Vietnam
Malaysia
Syria
Serbia
Libya
Tunisia
West Bank andVenezuela
Gaza
Hungary
Bulgaria
China Africa
Nicaragua
Colombia
Mauritius
Country

Saudi Arabia
Jamaica
Jordan
Romania
SriAlgeria
Lanka
Brazil Americas
Dominican Republic
Lebanon
ElParaguay
Salvador
Turkey
Philippines
Peru
Egypt
Morocco Asia
Iran
Indonesia
Thailand
Trinidad Guatemala
Honduras
Korea,andDem.Tobago
Rep.
Mongolia
Bolivia Europe
Sao Tome andComoros
Principe
Pakistan
India
Mauritania
Bangladesh
Yemen, Nepal
Senegal
Rep. Oceania
MyanmarHaiti
Ghana
Cambodia
GambiaIraq
Madagascar
Sudan
Togo
Eritrea
Niger
Gabon
Congo, Benin
Guinea
Rep.
Djibouti
Mali
Kenya
Ethiopia
Namibia
Tanzania
Burkina
Equatorial Faso
Guinea
Uganda
Botswana
Chad
Cameroon
South
Cote Burundi
Africa
d'Ivoire
Malawi
Somalia
Nigeria
Congo, Dem.
Guinea−Bissau Rep.
Rwanda
Central African Liberia
Republic
Afghanistan
Zimbabwe
Angola
Lesotho
SierraZambia
Leone
Mozambique
Swaziland
0 20 40 60 80
Life Expectancy

7. Line Chart: Life Expectancy Trends in Asia

This line chart visualizes the trends in life expectancy over time for countries in Asia. It provides a detailed
view of how life expectancy has evolved in different Asian countries, highlighting variations and patterns.
Each line represents a country, enabling a comparative view of how life expectancy has changed from year
to year.

# Filter data for Asia and create a line chart showing life expectancy trends over time
line_chart <- gapminder %>%
filter(continent == "Asia") %>%
ggplot(aes(x = year, y = lifeExp, color = country, group = country)) +
geom_line() +
labs(title = "Life Expectancy Trend in Asia",
x = "Year",
y = "Life Expectancy") +
theme_minimal()

line_chart

78
Life Expectancy Trend in Asia country
Afghanistan Malaysia

80 Bahrain Mongolia
Bangladesh Myanmar
Cambodia Nepal
70 China Oman
Hong Kong, China Pakistan
Life Expectancy

India Philippines
60
Indonesia Saudi Arabia
Iran Singapore

50 Iraq Sri Lanka


Israel Syria
Japan Taiwan
40
Jordan Thailand
Korea, Dem. Rep. Vietnam
Korea, Rep. West Bank and Gaza
30
Kuwait Yemen, Rep.
1950 1960 1970 1980 1990 2000 Lebanon
Year

8. Interactive Dashboard: Combined Visualizations

To provide a holistic view, the scatter plot, bar chart, and line chart are integrated into an interactive
dashboard. This layout allows users to explore multiple aspects of the data simultaneously. It combines the
key insights from individual plots into a single interface, making it easier to derive meaningful conclusions.

# Combine the scatter, bar, and line charts into one interactive layout
dashboard <- subplot(scatter_plot, bar_chart, line_chart, nrows = 1) %>%
layout(title = 'Gapminder Data Visualization')

# Display the dashboard


dashboard

79
Gapminder Data Visualization
Hong Kong, Japan
China
Iceland
Switzerland
Australia
Spain
Sweden
80 New
Israel
France
Canada
Italy
Zealand
Norway 80
country
Singapore
Austria
Netherlands
Greece
Belgium
United Germany
Kingdom
Finland Africa
Costa
Puerto
Korea, Ireland
Rica
Rico
Rep.
DenmarkChile
Taiwan
Cuba
United States
Portugal Africa
Czech Slovenia
Kuwait
Republic
Reunion
Albania
Uruguay 70
Mexico
Croatia
Oman
Bahrain
Poland
Panama
Argentina
Ecuador
Bosnia and Herzegovina
Slovak Republic
Montenegro
Vietnam Americas
Malaysia
Syria
Serbia
Libya
and Tunisia
BankVenezuela
West 60 Gaza
Hungary
Bulgaria
China
Nicaragua
Colombia 60 Americas
SaudiMauritius
Arabia
Jamaica
Jordan
Romania
Sri Lanka
Brazil
Algeria
Dominican El Republic
Lebanon
Salvador
Turkey
Paraguay
Philippines
Peru
Egypt
Morocco
Indonesia
ThailandIran Asia
Trinidad
Korea, Guatemala
andHonduras
Dem. Tobago
Rep. 50
Sao Tome and Mongolia
Bolivia
Principe
Pakistan Asia
ComorosIndia
Mauritania
Bangladesh
Yemen, Nepal
Senegal
Myanmar Rep.
Cambodia
40 Madagascar Haiti
GhanaIraq
Gambia
Sudan
Togo
Eritrea
Niger
Gabon
Benin 40 Europe
Congo, Guinea
Rep.
Djibouti
Mali
Kenya
Ethiopia
Namibia
Tanzania
Burkina
Equatorial Faso
Guinea
Uganda
Botswana Europe
Cameroon
South Chad
Burundi
Africa
Cote d'Ivoire
Malawi
Somalia
Nigeria
Congo, Dem.
Guinea-Bissau
RwandaRep.
Liberia 30
Central African Republic
Afghanistan
Zimbabwe
Angola Oceania
Sierra Lesotho
Leone
Zambia
Mozambique
Swaziland
1e+03 1e+04 1e+05 0 20406080
1950
1960
1970
1980
1990
2000

9. Conclusion

The visualizations above provide insights into global trends in life expectancy, GDP per capita, and pop-
ulation. By combining interactive elements, users are empowered to explore patterns, relationships, and
disparities in the data. Using the ‘plotly’ library, dynamic and visually engaging visualizations have been
created, enhancing data exploration and analysis.

80

You might also like