Programming With R
Programming With R
COLLEGE OF ENGINEERING
(Autonomous College under VTU)
Bull Temple Road, Basavangudi, Bangalore - 560019
Lab Observation
on
Programming with R
Submitted by
BACHELOR OF ENGINEERING
in
Computer Science & Engineering (Data Science)
Dr. Kalyan N
Assistant Professor
Department of CSE (Data Science),
B.M.S. College of Engineering
2024-2025
TABLE OF CONTENTS
Date of Faculty
SI No Program Marks Page No
Execution In-Charge Sign
Arithmetic Operations,
1 Variable Assignment, 18/10/2024 2
and Conditional Statements
Creating and Manipulating
2 25/10/2024 5
Data Structures
Basic Statistical
3 Operations on 25/10/2024 13
Open-Source Datasets
Data Import, Cleaning,
4 and Export with 08/11/2024 22
Advanced Data Wrangling
Advanced Data
5 Manipulation with dplyr 08/11/2024 28
and Complex Grouping
Data Visualization with
6 15/11/2024 36
ggplot2 and Customizations
Linear and Multiple
7 Regression Analysis 22/11/2024 42
with Interaction Terms
K-Means Clustering
8 and PCA for 22/11/2024 50
Dimensionality Reduction
Time Series Analysis
9 using ARIMA and 20/12/2024 58
Seasonal Decomposition
Interactive Visualization
10 with plotly and Dynamic 20/12/2024 76
Reports with RMarkdown
1
Program - 1
DATE - 18/10/2024
1. Introduction
This R program determines whether three given side lengths form a valid triangle. If valid, the program
identifies the type of triangle (Equilateral, Isosceles, or Scalene) and calculates its area using Heron’s formula.
This process involves input validation, application of geometric principles, and computation techniques, all
explained step by step.
2. Code Explanation
The is_valid_triangle function verifies if the given side lengths satisfy the triangle inequality theorem, a
fundamental condition for forming a triangle.
Triangle Type
The triangle_type function identifies the type of triangle based on its side lengths.
• Logic:
– Equilateral: All sides are equal.
– Isosceles: Two sides are equal.
– Scalene: All sides are different.
2
Area Calculation
The triangle_area function calculates the area of a triangle using Heron’s formula.
Input Validation
The validate_input function ensures that each input is a positive numeric value.
• Logic:
– Non-numeric or non-positive values trigger an error.
This block integrates the functions to perform the required operations interactively.
## Side a: 3
## Side b: 4
3
cat("Side c:", c, "\n")
## Side c: 5
• Input Handling: The program prompts the user to input the three side lengths, converting them to
numeric values for further processing.
• Error Handling: The tryCatch block manages errors efficiently, ensuring that the program does not
crash due to invalid input or failed validations. If an error is encountered, a meaningful message is
displayed.
• Sequential Processing:
1. Each side length is validated using the validate_input function.
2. The validity of the triangle is checked with the is_valid_triangle function.
3. If valid, the type of the triangle is determined and displayed.
4. Finally, the area of the triangle is computed and printed.
3. Conclusion
This R program effectively validates triangle properties, identifies its type, and calculates its area using
well-structured functions. Each step follows mathematical and computational principles, ensuring accurate
results. This program is a practical tool for triangle analysis and can be extended for educational or practical
purposes.
4
Program - 2
DATE - 25/10/2024
1. Introduction
This document demonstrates various operations in R, including vector manipulation, matrix operations, list
handling, and data frame analysis. The primary aim was to perform several data manipulations using R’s
built-in functions and packages.
2. Vector Operations
The first step is to generate a random vector using the runif() function, which creates uniformly distributed
random numbers. This vector consists of 20 random values between 1 and 100.
set.seed(42)
random_vector <- runif(20, min = 1, max = 100)
cat("Random vector created:\n", head(random_vector,8))
Next, the vector is sorted in ascending order using the sort() function. Sorting helps to organize the data
and identify trends or outliers.
## Sorted vector:
## 12.63125 14.33199 26.28745 29.32781 46.31644 46.76699
A specific value (50) is searched for within the random vector. The any() function is used to check if the
value exists in the vector.
search_value <- 50
is_value_present <- any(random_vector == search_value)
cat("Is", search_value, "present in the vector? ", is_value_present)
5
Filtering Values Greater Than 60
Values greater than 60 are filtered from the vector using logical indexing. Filtering is helpful to subset data
based on certain conditions.
3. Matrix Operations
Creating a Matrix
The random vector is reshaped into a 4x5 matrix using the matrix() function. This matrix format allows
for more complex data operations.
print(matrix_from_vector)
print(matrix_multiplication_result)
6
elementwise_multiplication_result <- matrix_from_vector * matrix_from_vector
cat("Element-wise multiplication result:\n")
print(elementwise_multiplication_result)
4. List Operations
Creating a List
A list is created combining various data types, including numeric vectors, characters, logical values, and
matrices. Lists in R can hold elements of different types, unlike vectors and matrices.
## List created:
print(my_list)
## $numbers
## [1] 91.56580 93.77047 29.32781 83.21431 64.53281 52.39050 73.92224 14.33199
## [9] 66.04224 70.80141 46.31644 72.19211 93.53255 26.28745 46.76699 94.06144
## [17] 97.84442 12.63125 48.02471 56.47294
##
## $characters
## [1] "A" "B" "C" "D"
##
## $logical_values
## [1] TRUE FALSE TRUE
##
## $matrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 91.56580 64.53281 66.04224 93.53255 97.84442
## [2,] 93.77047 52.39050 70.80141 26.28745 12.63125
## [3,] 29.32781 73.92224 46.31644 46.76699 48.02471
## [4,] 83.21431 14.33199 72.19211 94.06144 56.47294
Specific elements are extracted from the list using $ notation. This allows accessing individual components
of the list.
7
subset_numeric <- my_list$numbers
cat("Numeric subset of the list:\n", head(subset_numeric),"\n")
The second character in the character vector of the list is modified to “Z”. This shows how to update elements
within a list.
The numbers in the list are squared using the ˆ operator to demonstrate vectorized operations in R, where
each element is squared individually.
## Squared numbers:
## 8384.295 8792.9 860.1207 6924.622 4164.483 2744.764
A data frame is created to represent structured tabular data. This data frame contains columns for ID, age,
score, and pass/fail status.
df <- data.frame(
ID = 1:20,
Age = sample(18:65, 20, replace = TRUE),
Score = runif(20, min = 50, max = 100),
Passed = sample(c(TRUE, FALSE), 20, replace = TRUE)
)
cat("Data frame created:\n")
8
print(df)
The data frame is filtered to extract rows where both Age > 30 and Score > 70. This filtering is essential
for focusing on specific subgroups of data.
print(filtered_df)
Summary Statistics
Summary statistics (mean, sum, and variance) are calculated for both the Age and Score columns. These
statistics are useful for understanding the central tendency and variability of the data.
9
mean_age <- mean(df$Age)
sum_age <- sum(df$Age)
var_age <- var(df$Age)
print(df)
10
df$Score[is.na(df$Score)] <- mean(df$Score, na.rm = TRUE)
cat("Data frame after handling missing values:\n")
print(df)
Grouped Statistics
Using the dplyr package, statistics for Age and Score are calculated grouped by the Passed status. This
approach allows for understanding the differences in data across categories.
library(dplyr)
grouped_stats <- df %>%
group_by(Passed) %>%
summarise(
mean_score = mean(Score, na.rm = TRUE),
mean_age = mean(Age)
)
cat("Grouped statistics by Passed status:\n")
print(grouped_stats)
## # A tibble: 2 x 3
## Passed mean_score mean_age
## <lgl> <dbl> <dbl>
## 1 FALSE 80.6 44.9
## 2 TRUE 72.6 50
11
6. Conclusion
This analysis covers essential operations on vectors, matrices, lists, and data frames in R. The operations
performed include sorting, searching, filtering, subsetting, and summarizing data. These methods are foun-
dational for data analysis and statistical modeling in R.
12
Program - 3
Datasets
DATE - 25/10/2024
1. Introduction
This analysis performs statistical analysis on two datasets: the Iris dataset and the Palmer Penguins dataset.
Various statistical metrics like mean, median, mode, variance, standard deviation, skewness, and kurtosis are
calculated. Hypothesis testing and data visualization are also performed to explore these datasets further.
The following libraries are loaded:
‘dplyr’ for data manipulation, ‘ggplot2’ for creating visualizations, ‘moments’ for calculating skewness and
kurtosis, ‘palmerpenguins’ for the Palmer Penguins dataset.
library(dplyr)
library(ggplot2)
library(moments)
library(palmerpenguins)
2. Load Datasets
The ‘iris’ and ‘penguins’ datasets are loaded. The ‘iris’ dataset comes from R’s built-in datasets and contains
measurements of sepal and petal lengths and widths for different iris species. The ‘penguins’ dataset provides
measurements of flipper length, body mass, and other characteristics for three penguin species.
data(iris)
data(penguins)
The mode is defined as the most frequently occurring value in a dataset. A custom function ‘calc_mode()’
is defined to calculate the mode by sorting the frequency of the elements and returning the most frequent
one.
Mean
The mean is calculated for each of the numeric columns in the ‘iris’ dataset using the ‘sapply()’ function.
The ‘na.rm = TRUE’ argument ensures that missing values are ignored during the calculation.
13
iris_mean <- sapply(iris[, 1:4], mean, na.rm = TRUE)
print(paste("Mean of Iris dataset: ", iris_mean))
This code calculates the mean of the sepal length, sepal width, petal length, and petal width for all species
in the ‘iris’ dataset
Median
The median is calculated similarly to the mean. The median is the middle value when the data is sorted. If
the dataset has an even number of values, the average of the two middle values is taken.
This code calculates the median for the four numeric columns in the ‘iris’ dataset.
Mode
The mode is calculated for each numeric column using the ‘calc_mode()’ function defined earlier.
This code calculates the mode for sepal length, sepal width, petal length, and petal width.
Variance
Variance measures the spread of data points. The ‘var(’)’ function is used to compute variance for each
numeric column in the dataset.
This code computes the variance for the four numeric columns in the ‘iris’ dataset.
14
Standard Deviation
The standard deviation is the square root of the variance and provides a measure of the spread of data
points. It is calculated using the ‘sd()’ function.
This code computes the standard deviation for each numeric column in the ‘iris’ dataset.
Skewness
Skewness measures the asymmetry of the distribution of data. The ‘skewness()’ function from the ‘moments’
package is used to compute skewness.
This code calculates the skewness for the numeric columns in the ‘iris’ dataset.
Kurtosis
Kurtosis measures the “tailedness” of the distribution. The ‘kurtosis()’ function is used to compute kurtosis
for each numeric column in the dataset.
This code calculates the kurtosis for the four numeric columns in the ‘iris’ dataset.
5. Hypothesis Testing
A t-test is performed to compare the Sepal Length between the Setosa and Versicolor species. The ‘t.test()’
function is used to test if the means of two independent groups are different.
15
setosa <- subset(iris, Species == "setosa")$Sepal.Length
versicolor <- subset(iris, Species == "versicolor")$Sepal.Length
t_test <- t.test(setosa, versicolor)
print(t_test)
##
## Welch Two Sample t-test
##
## data: setosa and versicolor
## t = -10.521, df = 86.538, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.1057074 -0.7542926
## sample estimates:
## mean of x mean of y
## 5.006 5.936
This code compares the Sepal Length of Setosa and Versicolor species using a two-sample t-test.
20
15
count
10
4 5 6 7 8
Sepal.Length
16
This code creates a histogram to show the distribution of Sepal Length in the ‘iris’ dataset.
A boxplot is created to show the variation in Sepal Length for each species. The ‘geom_boxplot()’ function
is used to plot the boxplot.
Species
Sepal.Length
setosa
6 versicolor
virginica
This code creates a boxplot to visualize the Sepal Length distribution across different species in the ‘iris’
dataset.
Data Cleaning
The ‘na.omit()’ function is used to remove rows with missing values in the ‘penguins’ dataset.
This code removes rows containing missing values in the ‘penguins’ dataset.
Mean
The mean of the numeric columns (flipper length, body mass, etc.) in the ‘penguins’ dataset is calculated.
17
penguins_mean <- sapply(penguins_clean[, 3:6], mean, na.rm = TRUE)
print(paste("Mean of Palmer Penguins dataset: ", penguins_mean))
This code calculates the mean for each numeric column in the ‘penguins’ dataset.
Median
The median is calculated for the numeric columns in the cleaned ‘penguins’ dataset.
This code computes the median for the numeric columns in the ‘penguins’ dataset.
Mode
The mode is calculated for the numeric columns in the ‘penguins’ dataset using the ‘calc_mode()’ function.
This code calculates the mode for the numeric columns in the ‘penguins’ dataset.
Variance
The variance for the numeric columns in the ‘penguins’ dataset is calculated.
This code computes the variance for each numeric column in the ‘penguins’ dataset.
18
Standard Deviation
The standard deviation is calculated for each numeric column in the ‘penguins’ dataset.
This code computes the standard deviation for each numeric column in the ‘penguins’ dataset.
Skewness
The skewness for each numeric column in the ‘penguins’ dataset is calculated.
This code calculates the skewness for the numeric columns in the ‘penguins’ dataset.
Kurtosis
The kurtosis for each numeric column in the ‘penguins’ dataset is calculated.
This code calculates the kurtosis for the numeric columns in the penguins dataset.
8. Hypothesis Testing
A t-test is performed to compare the flipper length between the ‘Adelie’ and ‘Gentoo’ species in the ‘penguins’
dataset.
19
##
## Welch Two Sample t-test
##
## data: adelie and gentoo
## t = -33.506, df = 251.35, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -28.72740 -25.53771
## sample estimates:
## mean of x mean of y
## 190.1027 217.2353
This code compares the flipper length of Adelie and Gentoo species using a two-sample t-test.
A histogram is created to visualize the distribution of flipper length in the ‘penguins’ dataset.
30
20
count
10
This code creates a histogram to show the distribution of flipper length in the cleaned ‘penguins’ dataset.
A boxplot is created to show the variation in flipper length across different species in the ‘penguins’ dataset.
20
ggplot(penguins_clean, aes(x = species, y = flipper_length_mm, fill = species)) +
geom_boxplot() +
ggtitle("Boxplot of Flipper Length by Species in Palmer Penguins Dataset")
230
220
flipper_length_mm
210
species
Adelie
200 Chinstrap
Gentoo
190
180
170
Adelie Chinstrap Gentoo
species
This code creates a boxplot to visualize the flipper length distribution across different species.
10. Conclusion
The analysis provides insights into the Iris and Palmer Penguins datasets through various statistical met-
rics and visualizations. These results are useful for understanding the characteristics of the datasets and
performing further exploratory analysis.
21
Program - 4
DATE - 08/11/2024
1. Introduction
In this analysis, the steps of loading, cleaning, and analyzing two datasets, the Titanic dataset and the
Adult Income dataset, are followed. The process starts with importing the data, followed by cleaning steps
like handling missing values, removing outliers, and generating summary statistics. The cleaned datasets
are then used to explore correlations between variables.
library(tidyverse)
library(titanic)
library(dplyr)
library(caret)
library(ggcorrplot)
3. Titanic Dataset
The Titanic dataset is imported using the ‘titanic_train’ dataset from the ‘titanic’ library.
Missing values in the ‘Age’ column are replaced with the median value.
This step replaces all missing values in the ‘Age’ column with the median of the non-missing values, ensuring
no gaps in the data.
22
Missing Embarked Values
Missing values in the ‘Embarked’ column are replaced with the mode (the most frequent value) of that
column.
Here, the most frequent value in the ‘Embarked’ column is determined and used to fill the missing values.
To remove outliers, the Z-scores are calculated for the numeric columns using the ‘scale()’ function. Any
rows with a Z-score greater than 3 or less than -3 are considered outliers and are removed.
This code identifies rows with Z-scores outside the range of -3 to 3, which are treated as outliers and removed
from the dataset.
Before Cleaning
This step provides an overview of the original dataset, including the minimum, maximum, and median values
of each column.
After Cleaning
This step provides an overview of the cleaned dataset, helping to confirm that missing values have been
handled and outliers removed.
Correlation Matrix
The correlation matrix of numeric columns in the cleaned Titanic dataset is calculated. This matrix shows
the relationships between the numeric variables.
The correlation matrix is used to identify any strong linear relationships between the numeric variables.
23
Exporting the Cleaned Data
The cleaned dataset is saved to a CSV file for future analysis or reporting.
his code exports the cleaned dataset to a CSV file, making it available for future use.
The correlation matrix is visualized using the ‘ggcorrplot()’ function. The circles represent the correlation
coefficients, with the size of the circle indicating the strength of the correlation.
0.5
Age 0.01 −0.09 −0.34 1 −0.14 −0.25 0.16
0.0
−1.0
Survived −0.01 1 −0.32 −0.09 0.09 0.2 0.33
ss
ch
e
ive
Ag
bS
r
er
Fa
la
r
Pa
ng
Pc
Si
rv
Su
se
s
Pa
This plot visually represents the correlation matrix, making it easier to identify relationships between the
variables.
The Adult Income dataset is imported from a local file path, and missing values are handled. The dataset
contains demographic and economic information, with the goal of predicting whether a person earns more
or less than 50K per year.
24
data <- read.csv("adult.data", header = FALSE)
The column names for the Adult Income dataset are manually specified to better understand the data.
Missing values represented as ‘?’ are replaced with NA for easier handling.
A custom function ‘replace_mode()’ replaces missing values in categorical columns with the mode of that
column.
The ‘replace_mode()’ function replaces missing values in the categorical columns with the most frequent
value (mode).
For numeric columns, missing values are replaced with the median value of that column.
This step ensures that missing values in numeric columns are replaced with the median of the respective
columns.
Outliers in the Adult Income dataset are removed using Z-scores, similar to the Titanic dataset.
25
remove_outliers <- function(x) {
z_scores <- scale(x)
x[abs(z_scores) <= 3]
}
This code detects outliers in the numeric columns and removes any rows with Z-scores outside the range of
-3 to 3.
Before Cleaning
After Cleaning
The correlation matrix is calculated for the numeric columns in the cleaned Adult Income dataset and
visualized using ‘ggcorrplot()’.
26
Correlation Matrix of Adult Income Dataset
0.0
education_num 0.04 −0.04 1 0.15 0.01 0.15
−0.5
k
in
ss
e
gt
ee
ag
ga
nu
lw
lo
w
l_
fn
l_
n_
r_
ta
ta
tio
e
pi
pi
_p
a
ca
ca
uc
u rs
ed
ho
This step visualizes the correlations between the numeric columns and saves the cleaned dataset to a CSV
file for future analysis.
5. Conclusion
The analysis demonstrates how to clean and analyze the Titanic and Adult Income datasets. Missing values
are handled, outliers are removed, and the correlation between variables is visualized. The cleaned datasets
are now ready for further exploration or modeling.
27
Program - 5
Complex Grouping
DATE - 08/11/2024
1. Introduction
In this analysis, the Star Wars dataset and the Flights dataset from the nycflights13 package are
explored. Various operations such as filtering, grouping, and summarizing data are performed, along with
visualizations created using ggplot2. Techniques for joining data frames, calculating rolling averages, and
cumulative sums are also demonstrated.
library(dplyr)
library(nycflights13)
library(ggplot2)
library(zoo)
The starwars dataset contains information about characters in the Star Wars universe. The dataset is
loaded, and the head() function is used to preview the first few rows.
data("starwars")
head(starwars)
## # A tibble: 6 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sky~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
## 3 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
## 4 Darth Va~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia Org~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen Lars 178 120 brown, gr~ light blue 52 male mascu~
## # i 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
28
3.2 Filter and Arrange the Data
head(starwars_filtered)
## # A tibble: 6 x 4
## name species height mass
## <chr> <chr> <int> <dbl>
## 1 Yarael Poof Quermian 264 NA
## 2 Tarfful Wookiee 234 136
## 3 Lama Su Kaminoan 229 88
## 4 Chewbacca Wookiee 228 112
## 5 Roos Tarpals Gungan 224 82
## 6 Grievous Kaleesh 216 159
29
Height of Star Wars Characters species
Sebulba
Gasgano Besalisk Nautolan
Mon Watto
Mothma
Leia
NienOrgana
Nunb Cerean Neimodian
ShmiQuadinaros
Ben Skywalker
Beru Whitesun Dormé
Lars
Barriss Offee Chagrian Pau'an
Jocasta
C−3PO Nu
ZamAntilles
Wedge Wesell Clawdite Quermian
Luminara Palpatine
Unduli
FinisEeth
Valorum
Koth
Luke Skywalker Droid Rodian
Greedo
Lobot
Jabba Desilijic Tiure
LandoDarth Maul
Calrissian
Shaak Ti
Dug Skakoan
Owen Lars
Ayla Secura
Wilhuff Tarkin Geonosian Sullustan
BibHan Solo
Character
Fortuna
Ackbar
Obi−WanRic Kenobi
Olié Gungan Tholothian
Quarsh
Poggle thePanaka
Lesser
Jango
Cliegg Fett Human Togruta
BobaLars
Biggs DarklighterFett
Padmé Adi Gallia
Amidala Hutt Toong
Saesee
Raymus Tiin
Antilles
Mace PloWindu
Koon
Anakin Skywalker Iktotchi Toydarian
Bossk
Nute San
Bail Prestor Gunray
Organa
Hill
Kaleesh Trandoshan
Wat Tambor
Qui−Gon Jinn
Dooku
Mas Amedda
KitBinks
Fisto Kaminoan Twi'lek
Jar Jar
Ki−Adi−Mundi
Dexter Jettster
IG−88 Kel Dor Wookiee
Darth
Tion Vader
Medon
Rugor
Taun Nass
We
Grievous Mirialan Xexto
Roos Tarpals
Chewbacca
Lama Su
YaraelTarfful
Poof Mon Calamari Zabrak
0 100 200 Muun
Height (cm)
The data is grouped by species, and for each group, the following is calculated:
The results are arranged in descending order of count to show the species with the most characters.
head(species_summary)
## # A tibble: 6 x 4
## species avg_height avg_mass count
## <chr> <dbl> <dbl> <int>
## 1 Human 178 81.3 35
## 2 Droid 131. 69.8 6
## 3 <NA> 175 81 4
30
## 4 Gungan 209. 74 3
## 5 Kaminoan 221 88 2
## 6 Mirialan 168 53.1 2
A bar plot is created to show the average height for each species:
A new column, height_category, is created using mutate() to classify characters as “Tall” if their height
is greater than 180 cm, and “Short” otherwise.
head(starwars_classified)
31
## # A tibble: 6 x 15
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sky~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
## 3 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
## 4 Darth Va~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia Org~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen Lars 178 120 brown, gr~ light blue 52 male mascu~
## # i 6 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>, height_category <chr>
A bar plot is created to show the distribution of characters classified into “Tall” and “Short” categories.
40
30
height_category
Count
Short
20 Tall
NA
10
Short Tall NA
Height Category
The flights and airlines datasets are joined on the carrier column using an inner join. This keeps only
rows where there is a matching carrier in both datasets.
32
data("flights")
data("airlines")
head(flights_inner_join)
## # A tibble: 6 x 20
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # i 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, name <chr>
A full outer join is performed to merge all rows from both datasets, filling in missing values where no match
is found.
head(flights_outer_join)
## # A tibble: 6 x 20
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # i 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, name <chr>
The flights dataset is sorted by year, month, and day. A 5-period rolling average of arr_delay is
calculated using zoo::rollmean(), and the cumulative sum of arr_delay is calculated using cumsum().
33
Handle Missing Values and Recalculate
Missing values in arr_delay are replaced with 0 before recalculating the rolling average and cumulative
delay.
This plot visualizes the rolling average delay and cumulative delay for flights:
2000
1500
Delay (minutes)
colour
Cumulative Delay (x1000)
1000
Rolling Average Delay
500
0 10 20 30
Day of the Month
34
5. Conclusion
This analysis explores the starwars and flights datasets. Data manipulation tasks such as filtering,
grouping, and summarizing data are performed, and ggplot2 is used to create meaningful visualizations.
Techniques for joining datasets and calculating rolling averages and cumulative sums are also demonstrated.
35
Program - 6
Customizations
DATE - 15/11/2024
1. Introduction
This analysis focuses on creating advanced visualizations using the ggplot2 package in R. The datasets
include mpg and diamonds, and the visualizations showcase scatter plots, faceted plots, heatmaps, and
annotated graphs. Customizations such as themes, annotations, and color palettes enhance the visual
appeal and interpretability.
The following libraries are loaded for this analysis: - ggplot2: For data visualization. - reshape2: For
reshaping data to create heatmaps. - dplyr: For data manipulation.
library(ggplot2)
library(reshape2)
library(dplyr)
This section creates a scatter plot to analyze the relationship between engine displacement (displ) and
highway miles per gallon (hwy). A regression line with confidence intervals is added for further insights.
36
Scatter Plot of Engine Displacement vs Highway MPG with Regression Line
40
Highway Miles per Gallon
30
20
10
2 3 4 5 6 7
Engine Displacement (L)
The scatter plot includes: - Points representing individual observations. - A regression line (dashed) showing
the trend. - Confidence intervals around the regression line.
Faceted scatter plots allow data to be visualized separately by vehicle class. This approach highlights
class-specific patterns.
37
Faceted Scatter Plot by Vehicle Class
2seater compact midsize
40
30
20
Highway Miles per Gallon
30
20
2 3 4 5 6 7 2 3 4 5 6 7
suv
40
30
20
2 3 4 5 6 7
Engine Displacement (L)
Each facet represents a vehicle class, making it easier to compare relationships within and across classes.
This section visualizes correlations among numeric variables in the diamonds dataset using a heatmap.
data("diamonds")
cor_matrix <- cor(diamonds[, sapply(diamonds, is.numeric)], use = "complete.obs")
cor_data <- melt(cor_matrix)
38
Heatmap of Correlation Matrix for Diamonds Dataset
y
Correlation
x 1.0
Variables
0.5
price
0.0
−0.5
table
−1.0
depth
carat
t
z
ra
pt
bl
ic
ca
pr
ta
de
Variables
The heatmap shows: - Strong correlations in red. - Weak correlations in white. - Negative correlations in
blue.
This section demonstrates a scatter plot with enhanced aesthetics, including a specific color palette and
additional customization.
39
Customized Scatter Plot with Aesthetic Enhancements
40
Highway Miles per Gallon
Class
2seater
compact
30
midsize
minivan
pickup
subcompact
20 suv
2 3 4 5 6 7
Engine Displacement (L)
The customized scatter plot uses: - Brewer palette for color differentiation. - Improved readability with
light theme and bold title.
The plot highlights a specific area with a rectangle and adds a label for context. The ggsave function saves
the plot as a PNG file.
40
8. Conclusion
This analysis showcases advanced visualization techniques using ggplot2. Techniques include scatter plots,
faceted plots, heatmaps, and annotated graphs. Customizations improve interpretability and presentation
quality.
41
Program - 7
Interaction Terms
DATE - 22/11/2024
1. Introduction
This analysis uses the Boston dataset to explore the relationships between housing prices and various
predictor variables. The primary goal is to build and evaluate regression models, including simple and
multiple linear regression, to understand the factors influencing median home values (medv). Additionally,
a classification approach is implemented to assess housing value categories using logistic regression.
library(MASS)
library(ggplot2)
library(caret)
library(car)
library(pROC)
library(dplyr)
library(corrplot)
data("Boston")
The required libraries are loaded to enable data manipulation, visualization, and model building. The
Boston dataset is used for regression analysis, containing information about housing values in Boston.
3. Preprocessing
sum(is.na(Boston))
## [1] 0
summary(Boston)
42
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
This section checks for missing values and provides a statistical summary of the data. A boxplot visualizes
the target variable medv, and potential outliers (homes with medv equal to 50) are removed to improve model
accuracy.
4. Feature Selection
43
cor_matrix <- cor(Boston)
corrplot(cor_matrix, method = "circle")
ptratio
medv
indus
black
chas
crim
lstat
age
nox
rad
tax
dis
rm
zn
1
crim
zn 0.8
indus
0.6
chas
0.4
nox
rm 0.2
age
0
dis
rad −0.2
tax
−0.4
ptratio
−0.6
black
lstat −0.8
medv
−1
The correlation matrix between numerical features is calculated and visualized using corrplot. Features
such as lstat (lower status of the population) and rm (average number of rooms per dwelling) are identified
as strong predictors of medv based on their correlations.
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.992 -3.313 -0.941 1.914 21.246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.54041 0.48150 67.58 <2e-16 ***
## lstat -0.84374 0.03268 -25.82 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
44
## Residual standard error: 5.119 on 488 degrees of freedom
## Multiple R-squared: 0.5774, Adjusted R-squared: 0.5765
## F-statistic: 666.6 on 1 and 488 DF, p-value: < 2.2e-16
A simple linear regression model is fitted with lstat as the predictor for medv. The coefficients and p-values
are interpreted to assess the significance and relationship between the variables.
##
## Call:
## lm(formula = medv ~ lstat * rm, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.3064 -2.4982 -0.3056 1.8635 18.4779
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.99970 3.07045 -8.468 2.98e-16 ***
## lstat 1.97178 0.17761 11.102 < 2e-16 ***
## rm 9.01216 0.46519 19.373 < 2e-16 ***
## lstat:rm -0.43817 0.02976 -14.723 < 2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 3.845 on 486 degrees of freedom
## Multiple R-squared: 0.7625, Adjusted R-squared: 0.761
## F-statistic: 520 on 3 and 486 DF, p-value: < 2.2e-16
A multiple linear regression model is fitted using lstat, rm, and their interaction term. The adjusted R-
squared is analyzed to measure the model’s explanatory power. Interaction terms help understand how rm
modifies the effect of lstat on medv.
## AIC: 2716.448
45
cat("BIC:", BIC_value, "\n")
## BIC: 2737.42
Model performance metrics such as adjusted R-squared, AIC, and BIC are calculated and printed. These
metrics help compare models by evaluating their goodness of fit and complexity.
355
178
10
Residuals
0
−10
−20
354
10 20 30 40
Fitted values
lm(medv ~ lstat * rm)
46
Normal Q−Q Plot
Q−Q Residuals
6
355
4
Standardized residuals
178
2
0
−2
−4
354
−6
−3 −2 −1 0 1 2 3
Theoretical Quantiles
lm(medv ~ lstat * rm)
Residual diagnostics are performed to check assumptions of linear regression. The residuals vs. fitted plot
evaluates homoscedasticity and linearity, while the Q-Q plot assesses the normality of residuals.
set.seed(123)
train_control <- trainControl(method = "cv", number = 10)
cv_model <- train(medv ~ lstat * rm, data = Boston, method = "lm",
trControl = train_control)
print(cv_model)
## Linear Regression
##
## 490 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 441, 441, 442, 441, 441, 440, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 3.855714 0.7597212 2.868657
##
## Tuning parameter ’intercept’ was held constant at a value of TRUE
Ten-fold cross-validation is used to evaluate the predictive performance of the multiple linear regression
model. The cross-validated RMSE provides an estimate of the model’s accuracy on unseen data.
47
10. ROC Curve Analysis (Classification Approach)
##
## Call:
## glm(formula = medv_class ~ lstat * rm, family = "binomial", data = Boston)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -19.80238 7.45022 -2.658 0.00786 **
## lstat -0.02301 0.75334 -0.031 0.97563
## rm 3.29921 1.12727 2.927 0.00343 **
## lstat:rm -0.03989 0.11492 -0.347 0.72851
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 536.34 on 489 degrees of freedom
## Residual deviance: 254.70 on 486 degrees of freedom
## AIC: 262.7
##
## Number of Fisher Scoring iterations: 8
plot(roc_curve, main = "ROC Curve for Logistic Regression Model", col = "blue")
abline(a = 0, b = 1, lty = 2, col = "red")
48
ROC Curve for Logistic Regression Model
1.0
0.8
0.6
Sensitivity
0.4
0.2
0.0
## AUC: 0.9392864
The regression problem is converted into a binary classification task by creating a new variable medv_class,
where values of medv greater than or equal to 25 are classified as 1 and others as 0. A logistic regression model
is fitted with interaction terms. The ROC curve and AUC are used to evaluate the model’s classification
performance.
11. Conclusion
The analysis demonstrates that a multiple linear regression model incorporating interaction terms between
lstat and rm provides a better fit compared to a simple linear regression model. The model’s performance
is assessed through adjusted R-squared, AIC, and BIC values, and its predictive accuracy is evaluated using
cross-validation. The assumptions of linear regression are reasonably satisfied based on residual diagnostics.
Additionally, a logistic regression model effectively classifies homes based on their median value, with the
ROC curve indicating satisfactory discriminatory power.
49
Program - 8
Dimensionality Reduction
DATE - 22/11/2024
1. Introduction
The analysis involves performing clustering on two datasets: the Wine dataset and the Breast Cancer
dataset. Principal Component Analysis (PCA) is used for dimensionality reduction, followed by k-means
clustering for group identification. Methods such as the elbow method and silhouette analysis are employed
to determine the optimal number of clusters. The objective is to explore patterns in the datasets and
visualize clustering results.
library(ggplot2)
library(cluster)
library(factoextra)
The required libraries are loaded, and the Wine dataset is prepared by normalizing its features to scale
values between 0 and 1.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.169 1.5802 1.2025 0.95863 0.92370 0.80103 0.74231
## Proportion of Variance 0.362 0.1921 0.1112 0.07069 0.06563 0.04936 0.04239
## Cumulative Proportion 0.362 0.5541 0.6653 0.73599 0.80162 0.85098 0.89337
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 0.59034 0.53748 0.5009 0.47517 0.41082 0.32152
## Proportion of Variance 0.02681 0.02222 0.0193 0.01737 0.01298 0.00795
## Cumulative Proportion 0.92018 0.94240 0.9617 0.97907 0.99205 1.00000
50
wine_pca_data <- as.data.frame(wine_pca$x[, 1:2])
PCA is applied to the normalized dataset, and the first two principal components are extracted for clustering.
1250
Total Within Sum of Square
1000
750
500
250
1 2 3 4 5 6 7 8 9 10
Number of clusters k
51
Average silhouette width Optimal number of clusters
0.4
0.2
0.0
1 2 3 4 5 6 7 8 9 10
Number of clusters k
The elbow method and silhouette analysis are utilized to identify the optimal number of clusters for k-means
clustering.
set.seed(123)
wine_kmeans <- kmeans(wine_pca_data, centers = 3, nstart = 25)
52
K−Means Clustering on Wine Dataset
4
cluster
1
PC2
0 2
3
−2
## Cluster Sizes: 64 65 49
K-means clustering is performed with the optimal number of clusters, and the results are visualized using a
scatter plot. Cluster sizes are reported.
The Breast Cancer dataset is loaded, and features are normalized to scale values between 0 and 1.
53
bc_pca <- prcomp(bc_norm, scale. = TRUE)
summary(bc_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.6444 2.3857 1.67867 1.40735 1.28403 1.09880 0.82172
## Proportion of Variance 0.4427 0.1897 0.09393 0.06602 0.05496 0.04025 0.02251
## Cumulative Proportion 0.4427 0.6324 0.72636 0.79239 0.84734 0.88759 0.91010
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.69037 0.6457 0.59219 0.5421 0.51104 0.49128 0.39624
## Proportion of Variance 0.01589 0.0139 0.01169 0.0098 0.00871 0.00805 0.00523
## Cumulative Proportion 0.92598 0.9399 0.95157 0.9614 0.97007 0.97812 0.98335
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.30681 0.28260 0.24372 0.22939 0.22244 0.17652 0.1731
## Proportion of Variance 0.00314 0.00266 0.00198 0.00175 0.00165 0.00104 0.0010
## Cumulative Proportion 0.98649 0.98915 0.99113 0.99288 0.99453 0.99557 0.9966
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.16565 0.15602 0.1344 0.12442 0.09043 0.08307 0.03987
## Proportion of Variance 0.00091 0.00081 0.0006 0.00052 0.00027 0.00023 0.00005
## Cumulative Proportion 0.99749 0.99830 0.9989 0.99942 0.99969 0.99992 0.99997
## PC29 PC30
## Standard deviation 0.02736 0.01153
## Proportion of Variance 0.00002 0.00000
## Cumulative Proportion 1.00000 1.00000
PCA is applied to the normalized dataset, and the first two principal components are extracted for clustering.
54
Optimal number of clusters
9000
Total Within Sum of Square
6000
3000
1 2 3 4 5 6 7 8 9 10
Number of clusters k
0.5
0.4
Average silhouette width
0.3
0.2
0.1
0.0
1 2 3 4 5 6 7 8 9 10
Number of clusters k
The elbow method and silhouette analysis are used to determine the optimal number of clusters for k-means
55
clustering.
set.seed(123)
bc_kmeans <- kmeans(bc_pca_data, centers = 2, nstart = 25)
0
cluster
PC2
1
2
−5
−10
−15 −10 −5 0 5
PC1
K-means clustering is performed with the optimal number of clusters. Results are visualized using a scatter
plot, and cluster sizes are reported.
56
4. Conclusion
The analysis successfully demonstrated the use of PCA for dimensionality reduction and k-means clustering
for grouping data points. For the Wine dataset, three clusters were identified, while the Breast Cancer
dataset exhibited two clusters. The elbow and silhouette methods guided the determination of the optimal
cluster numbers, and visualizations provided insights into the clustering patterns. These techniques showcase
how dimensionality reduction and clustering can reveal structure in complex datasets.
57
Program - 9
Seasonal Decomposition
DATE - 20/12/2024
1. Introduction
This document demonstrates how to perform time series analysis using two datasets: AirPassengers
(monthly international airline passenger counts from 1949 to 1960) and Monthly Milk Production (milk
production per cow from 1962 to 1975). The analysis focuses on the following key aspects:
1. Exploratory Data Analysis (EDA): Understanding the structure and properties of the datasets.
2. Decomposition: Breaking down the series into its components—trend, seasonality, and residuals.
3. Model Fitting: Building ARIMA and SARIMA models to predict future values.
The goal is to compare ARIMA and SARIMA models, assess their performance, and understand their
suitability for time series forecasting.
2. Loading Libraries
This section imports essential libraries for time series modeling and visualization.
• tseries: Includes functions for hypothesis testing, such as the Augmented Dickey-Fuller (ADF) test.
Utility functions simplify repetitive tasks such as analyzing the data, decomposing time series, building
models, and comparing forecasts.
58
3.1 Exploratory Data Analysis (EDA)
This function performs a detailed EDA, including statistical summaries and visualizations. Key components:
• Summary Statistics: Provides measures like mean, median, and range, giving a quick overview of
the dataset’s properties.
• ACF and PACF Plots: Help identify dependencies in the series, aiding in ARIMA model parameter
selection.
# Function to perform Exploratory Data Analysis (EDA) on the time series data
perform_eda <- function(ts_data, dataset_name) {
cat("Exploratory Data Analysis for ", dataset_name, "\n")
print(summary(ts_data))
plot(ts_data, main = paste(dataset_name, "Time Series"), ylab = "Values", xlab = "Time")
cat("ACF and PACF plots:\n")
acf(ts_data, main = paste("ACF of", dataset_name))
pacf(ts_data, main = paste("PACF of", dataset_name))
}
Decomposition is crucial for understanding how different components contribute to the series.
This function splits the time series into these components and visualizes them for further analysis.
# Function to decompose the time series into trend, seasonal, and residual components
decompose_ts <- function(ts_data, dataset_name) {
cat("Decomposing the time series for ", dataset_name, "\n")
decomposition <- decompose(ts_data)
plot(decomposition)
return(decomposition)
}
59
# Function to fit an ARIMA model to the time series data
fit_arima <- function(ts_data, dataset_name) {
cat("Fitting ARIMA model for ", dataset_name, "\n")
adf_test <- adf.test(ts_data, alternative = "stationary")
cat("ADF Test p-value:", adf_test$p.value, "\n")
if (adf_test$p.value > 0.05) {
ts_data <- diff(ts_data)
plot(ts_data, main = paste(dataset_name, "Differenced Time Series"))
}
auto_model <- auto.arima(ts_data, seasonal = FALSE)
print(summary(auto_model))
forecast_result <- forecast(auto_model, h = 12)
plot(forecast_result, main = paste(dataset_name, "ARIMA Forecast"))
return(auto_model)
}
SARIMA extends ARIMA by incorporating seasonality. It uses seasonal differencing to handle periodic
patterns.
This function automatically selects SARIMA parameters (P, D, Q) and fits the model to the data.
# Function to fit a Seasonal ARIMA (SARIMA) model to the time series data
fit_sarima <- function(ts_data, dataset_name) {
cat("Fitting SARIMA model for ", dataset_name, "\n")
auto_sarima <- auto.arima(ts_data, seasonal = TRUE)
print(summary(auto_sarima))
sarima_forecast <- forecast(auto_sarima, h = 12)
plot(sarima_forecast, main = paste(dataset_name, "SARIMA Forecast"))
return(auto_sarima)
}
This function calculates and compares the accuracy of ARIMA and SARIMA forecasts using metrics like
Root Mean Squared Error (RMSE).
This function creates a visualization comparing actual vs. forecasted values for ARIMA and SARIMA models.
The model with lower RMSE is highlighted with a better color.
60
# Function to visualize the comparison of ARIMA and SARIMA forecast performance
plot_forecast_comparison <- function(actual_values, arima_forecast, sarima_forecast,
time_points) {
arima_rmse <- sqrt(mean((arima_forecast - actual_values)ˆ2))
sarima_rmse <- sqrt(mean((sarima_forecast - actual_values)ˆ2))
better_color <- ifelse(arima_rmse < sarima_rmse, "green", "red")
worse_color <- ifelse(arima_rmse < sarima_rmse, "red", "green")
plot(time_points, actual_values, type = "o", col = "blue", pch = 16, lty = 1,
xlab = "Time", ylab = "Values", main = "Forecast Comparison")
lines(time_points, arima_forecast, col = better_color, lty = 2, lwd = 2)
lines(time_points, sarima_forecast, col = worse_color, lty = 3, lwd = 2)
legend("topright", legend = c("Actual Values", paste("ARIMA (RMSE =",
round(arima_rmse, 2), ")"), paste("SARIMA (RMSE =",
round(sarima_rmse, 2), ")")), col = c("blue", better_color,
worse_color), lty = c(1, 2, 3), lwd = c(1, 2, 2),
pch = c(16, NA, NA))
}
4. Dataset Analysis
The AirPassengers dataset records monthly international airline passenger numbers from 1949 to 1960.
Below is the analysis conducted on this dataset:
61
AirPassengers Time Series
600
500
400
Values
300
200
100
Time
ACF of AirPassengers
0.8
0.6
ACF
0.4
0.2
−0.2 0.0
Lag
62
PACF of AirPassengers
1.0
0.5
Partial ACF
0.0
−0.5
Lag
decompose_ts(air_data, "AirPassengers")
Time
63
## $x
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432
##
## $seasonal
## Jan Feb Mar Apr May Jun
## 1949 -24.748737 -36.188131 -2.241162 -8.036616 -4.506313 35.402778
## 1950 -24.748737 -36.188131 -2.241162 -8.036616 -4.506313 35.402778
## 1951 -24.748737 -36.188131 -2.241162 -8.036616 -4.506313 35.402778
## 1952 -24.748737 -36.188131 -2.241162 -8.036616 -4.506313 35.402778
## 1953 -24.748737 -36.188131 -2.241162 -8.036616 -4.506313 35.402778
## 1954 -24.748737 -36.188131 -2.241162 -8.036616 -4.506313 35.402778
## 1955 -24.748737 -36.188131 -2.241162 -8.036616 -4.506313 35.402778
## 1956 -24.748737 -36.188131 -2.241162 -8.036616 -4.506313 35.402778
## 1957 -24.748737 -36.188131 -2.241162 -8.036616 -4.506313 35.402778
## 1958 -24.748737 -36.188131 -2.241162 -8.036616 -4.506313 35.402778
## 1959 -24.748737 -36.188131 -2.241162 -8.036616 -4.506313 35.402778
## 1960 -24.748737 -36.188131 -2.241162 -8.036616 -4.506313 35.402778
## Jul Aug Sep Oct Nov Dec
## 1949 63.830808 62.823232 16.520202 -20.642677 -53.593434 -28.619949
## 1950 63.830808 62.823232 16.520202 -20.642677 -53.593434 -28.619949
## 1951 63.830808 62.823232 16.520202 -20.642677 -53.593434 -28.619949
## 1952 63.830808 62.823232 16.520202 -20.642677 -53.593434 -28.619949
## 1953 63.830808 62.823232 16.520202 -20.642677 -53.593434 -28.619949
## 1954 63.830808 62.823232 16.520202 -20.642677 -53.593434 -28.619949
## 1955 63.830808 62.823232 16.520202 -20.642677 -53.593434 -28.619949
## 1956 63.830808 62.823232 16.520202 -20.642677 -53.593434 -28.619949
## 1957 63.830808 62.823232 16.520202 -20.642677 -53.593434 -28.619949
## 1958 63.830808 62.823232 16.520202 -20.642677 -53.593434 -28.619949
## 1959 63.830808 62.823232 16.520202 -20.642677 -53.593434 -28.619949
## 1960 63.830808 62.823232 16.520202 -20.642677 -53.593434 -28.619949
##
## $trend
## Jan Feb Mar Apr May Jun Jul Aug
## 1949 NA NA NA NA NA NA 126.7917 127.2500
## 1950 131.2500 133.0833 134.9167 136.4167 137.4167 138.7500 140.9167 143.1667
## 1951 157.1250 159.5417 161.8333 164.1250 166.6667 169.0833 171.2500 173.5833
## 1952 183.1250 186.2083 189.0417 191.2917 193.5833 195.8333 198.0417 199.7500
## 1953 215.8333 218.5000 220.9167 222.9167 224.0833 224.7083 225.3333 225.3333
## 1954 228.0000 230.4583 232.2500 233.9167 235.6250 237.7500 240.5000 243.9583
## 1955 261.8333 266.6667 271.1250 275.2083 278.5000 281.9583 285.7500 289.3333
## 1956 309.9583 314.4167 318.6250 321.7500 324.5000 327.0833 329.5417 331.8333
## 1957 348.2500 353.0000 357.6250 361.3750 364.5000 367.1667 369.4583 371.2083
## 1958 375.2500 377.9167 379.5000 380.0000 380.7083 380.9583 381.8333 383.6667
## 1959 402.5417 407.1667 411.8750 416.3333 420.5000 425.5000 430.7083 435.1250
## 1960 456.3333 461.3750 465.2083 469.3333 472.7500 475.0417 NA NA
## Sep Oct Nov Dec
## 1949 127.9583 128.5833 129.0000 129.7500
64
## 1950 145.7083 148.4167 151.5417 154.7083
## 1951 175.4583 176.8333 178.0417 180.1667
## 1952 202.2083 206.2500 210.4167 213.3750
## 1953 224.9583 224.5833 224.4583 225.5417
## 1954 247.1667 250.2500 253.5000 257.1250
## 1955 293.2500 297.1667 301.0000 305.4583
## 1956 334.4583 337.5417 340.5417 344.0833
## 1957 372.1667 372.4167 372.7500 373.6250
## 1958 386.5000 390.3333 394.7083 398.6250
## 1959 437.7083 440.9583 445.8333 450.6250
## 1960 NA NA NA NA
##
## $random
## Jan Feb Mar Apr May Jun
## 1949 NA NA NA NA NA NA
## 1950 8.4987374 29.1047980 8.3244949 6.6199495 -7.9103535 -25.1527778
## 1951 12.6237374 26.6464646 18.4078283 6.9116162 9.8396465 -26.4861111
## 1952 12.6237374 29.9797980 6.1994949 -2.2550505 -6.0770202 -13.2361111
## 1953 4.9154040 13.6881313 17.3244949 20.1199495 9.4229798 -17.1111111
## 1954 0.7487374 -6.2702020 4.9911616 1.1199495 2.8813131 -9.1527778
## 1955 4.9154040 2.5214646 -1.8838384 1.8282828 -3.9936869 -2.3611111
## 1956 -1.2095960 -1.2285354 0.6161616 -0.7133838 -1.9936869 11.5138889
## 1957 -8.5012626 -15.8118687 0.6161616 -5.3383838 -4.9936869 19.4305556
## 1958 -10.5012626 -23.7285354 -15.2588384 -23.9633838 -13.2020202 18.6388889
## 1959 -17.7929293 -28.9785354 -3.6338384 -12.2967172 4.0063131 11.0972222
## 1960 -14.5845960 -34.1868687 -43.9671717 -0.2967172 3.7563131 24.5555556
## Jul Aug Sep Oct Nov Dec
## 1949 -42.6224747 -42.0732323 -8.4785354 11.0593434 28.5934343 16.8699495
## 1950 -34.7474747 -35.9898990 -4.2285354 5.2260101 16.0517677 13.9116162
## 1951 -36.0808081 -37.4065657 -7.9785354 5.8093434 21.5517677 14.4532828
## 1952 -31.8724747 -20.5732323 -9.7285354 5.3926768 15.1767677 9.2449495
## 1953 -25.1641414 -16.1565657 -4.4785354 7.0593434 9.1351010 4.0782828
## 1954 -2.3308081 -13.7815657 -4.6868687 -0.6073232 3.0934343 0.4949495
## 1955 14.4191919 -5.1565657 2.2297980 -2.5239899 -10.4065657 1.1616162
## 1956 19.6275253 10.3434343 4.0214646 -10.8989899 -15.9482323 -9.4633838
## 1957 31.7108586 32.9684343 15.3131313 -4.7739899 -14.1565657 -9.0050505
## 1958 45.3358586 58.5101010 0.9797980 -10.6906566 -31.1148990 -33.0050505
## 1959 53.4608586 61.0517677 8.7714646 -13.3156566 -30.2398990 -17.0050505
## 1960 NA NA NA NA NA NA
##
## $figure
## [1] -24.748737 -36.188131 -2.241162 -8.036616 -4.506313 35.402778
## [7] 63.830808 62.823232 16.520202 -20.642677 -53.593434 -28.619949
##
## $type
## [1] "additive"
##
## attr(,"class")
## [1] "decomposed.ts"
65
## ar1 ar2 ar3 ar4 ma1 ma2 drift
## 0.2243 0.3689 -0.2567 -0.2391 -0.0971 -0.8519 2.6809
## s.e. 0.1047 0.1147 0.0985 0.0919 0.0866 0.0877 0.1711
##
## sigma^2 = 706.3: log likelihood = -670.07
## AIC=1356.15 AICc=1357.22 BIC=1379.85
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -1.228696 25.82793 20.59211 -1.665245 7.476447 0.6428946
## ACF1
## Training set 0.0009861078
66
AirPassengers SARIMA Forecast
100 200 300 400 500 600 700
The forecasting comparison evaluates both ARIMA and SARIMA models by comparing their forecasts to
actual values.
h_air <- 12
air_actual_values <- air_data[(length(air_data) - h_air + 1):length(air_data)]
arima_air_forecast <- forecast(arima_air, h = h_air)$mean
sarima_air_forecast <- forecast(sarima_air, h = h_air)$mean
time_points_air <- time(air_data)[(length(air_data) - h_air + 1):length(air_data)]
plot_forecast_comparison1(air_actual_values, arima_air_forecast, sarima_air_forecast,
time_points_air)
67
Forecast Comparison
Actual Values
600
500
450
400
Time
The Monthly Milk Production dataset contains monthly records of milk production per cow from 1962
to 1975. Below is the analysis conducted on this dataset:
68
Monthly Milk Production Time Series
1700
Values
1500
1300
Time
0.2
0.0
−0.2
Lag
69
PACF of Monthly Milk Production
0.6
Partial ACF
0.2
−0.2
−0.6
Lag
Time
70
## $x
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1994 1343 1236 1401 1396 1457 1388 1389 1369 1318 1354 1312 1370
## 1995 1404 1295 1453 1427 1484 1421 1414 1375 1331 1364 1320 1380
## 1996 1415 1348 1469 1441 1479 1398 1400 1382 1342 1391 1350 1418
## 1997 1433 1328 1500 1474 1529 1471 1473 1446 1377 1416 1369 1438
## 1998 1466 1347 1515 1501 1556 1477 1468 1443 1386 1446 1407 1489
## 1999 1518 1404 1585 1554 1610 1516 1498 1487 1445 1491 1459 1538
## 2000 1579 1506 1632 1593 1636 1547 1561 1525 1464 1511 1459 1519
## 2001 1549 1431 1599 1571 1632 1555 1552 1520 1472 1522 1485 1549
## 2002 1591 1472 1654 1621 1678 1587 1578 1570 1497 1539 1496 1575
## 2003 1615 1489 1666 1627 1671 1596 1597 1571 1511 1561 1517 1596
## 2004 1624 1531 1661 1636 1692 1607 1623 1601 1533 1583 1531 1610
## 2005 1643 1522 1707 1690 1760 1690 1683 1671 1599 1637 1592 1663
##
## $seasonal
## Jan Feb Mar Apr May Jun Jul
## 1994 25.16919 -82.90657 75.61237 45.65783 97.34343 16.80934 12.89646
## 1995 25.16919 -82.90657 75.61237 45.65783 97.34343 16.80934 12.89646
## 1996 25.16919 -82.90657 75.61237 45.65783 97.34343 16.80934 12.89646
## 1997 25.16919 -82.90657 75.61237 45.65783 97.34343 16.80934 12.89646
## 1998 25.16919 -82.90657 75.61237 45.65783 97.34343 16.80934 12.89646
## 1999 25.16919 -82.90657 75.61237 45.65783 97.34343 16.80934 12.89646
## 2000 25.16919 -82.90657 75.61237 45.65783 97.34343 16.80934 12.89646
## 2001 25.16919 -82.90657 75.61237 45.65783 97.34343 16.80934 12.89646
## 2002 25.16919 -82.90657 75.61237 45.65783 97.34343 16.80934 12.89646
## 2003 25.16919 -82.90657 75.61237 45.65783 97.34343 16.80934 12.89646
## 2004 25.16919 -82.90657 75.61237 45.65783 97.34343 16.80934 12.89646
## 2005 25.16919 -82.90657 75.61237 45.65783 97.34343 16.80934 12.89646
## Aug Sep Oct Nov Dec
## 1994 -13.32323 -71.29293 -27.92929 -73.19066 -4.84596
## 1995 -13.32323 -71.29293 -27.92929 -73.19066 -4.84596
## 1996 -13.32323 -71.29293 -27.92929 -73.19066 -4.84596
## 1997 -13.32323 -71.29293 -27.92929 -73.19066 -4.84596
## 1998 -13.32323 -71.29293 -27.92929 -73.19066 -4.84596
## 1999 -13.32323 -71.29293 -27.92929 -73.19066 -4.84596
## 2000 -13.32323 -71.29293 -27.92929 -73.19066 -4.84596
## 2001 -13.32323 -71.29293 -27.92929 -73.19066 -4.84596
## 2002 -13.32323 -71.29293 -27.92929 -73.19066 -4.84596
## 2003 -13.32323 -71.29293 -27.92929 -73.19066 -4.84596
## 2004 -13.32323 -71.29293 -27.92929 -73.19066 -4.84596
## 2005 -13.32323 -71.29293 -27.92929 -73.19066 -4.84596
##
## $trend
## Jan Feb Mar Apr May Jun Jul Aug
## 1994 NA NA NA NA NA NA 1363.625 1368.625
## 1995 1384.042 1385.333 1386.125 1387.083 1387.833 1388.583 1389.458 1392.125
## 1996 1393.917 1393.625 1394.375 1395.958 1398.333 1401.167 1403.500 1403.417
## 1997 1421.208 1426.917 1431.042 1433.542 1435.375 1437.000 1439.208 1441.375
## 1998 1448.208 1447.875 1448.125 1449.750 1452.583 1456.292 1460.583 1465.125
## 1999 1486.750 1489.833 1494.125 1498.458 1502.500 1506.708 1511.292 1518.083
## 2000 1536.875 1541.083 1543.458 1545.083 1545.917 1545.125 1543.083 1538.708
## 2001 1530.958 1530.375 1530.500 1531.292 1532.833 1535.167 1538.167 1541.625
## 2002 1559.667 1562.833 1565.958 1567.708 1568.875 1570.417 1572.500 1574.208
## 2003 1577.375 1578.208 1578.833 1580.333 1582.125 1583.875 1585.125 1587.250
## 2004 1593.083 1595.417 1597.583 1599.417 1600.917 1602.083 1603.458 1603.875
## 2005 1626.917 1632.333 1638.000 1643.000 1647.792 1652.542 NA NA
## Sep Oct Nov Dec
## 1994 1373.250 1376.708 1379.125 1381.625
71
## 1995 1395.000 1396.250 1396.625 1395.458
## 1996 1403.875 1406.542 1410.000 1415.125
## 1997 1442.792 1444.542 1446.792 1448.167
## 1998 1470.417 1475.542 1480.000 1483.875
## 1999 1524.292 1527.875 1530.583 1532.958
## 2000 1534.208 1531.917 1530.833 1531.000
## 2001 1545.625 1550.000 1554.000 1557.250
## 2002 1575.417 1576.167 1576.125 1576.208
## 2003 1588.792 1588.958 1590.208 1591.542
## 2004 1605.417 1609.583 1614.667 1620.958
## 2005 NA NA NA NA
##
## $random
## Jan Feb Mar Apr May
## 1994 NA NA NA NA NA
## 1995 -5.21085859 -7.42676768 -8.73737374 -5.74116162 -1.17676768
## 1996 -4.08585859 37.28156566 -0.98737374 -0.61616162 -16.67676768
## 1997 -13.37752525 -16.01010101 -6.65404040 -5.19949495 -3.71843434
## 1998 -7.37752525 -17.96843434 -8.73737374 5.59217172 6.07323232
## 1999 6.08080808 -2.92676768 15.26262626 9.88383838 10.15656566
## 2000 16.95580808 47.82323232 12.92929293 2.25883838 -7.26010101
## 2001 -7.12752525 -16.46843434 -7.11237374 -5.94949495 1.82323232
## 2002 6.16414141 -7.92676768 12.42929293 7.63383838 11.78156566
## 2003 12.45580808 -6.30176768 11.55429293 1.00883838 -8.46843434
## 2004 5.74747475 18.48989899 -12.19570707 -9.07449495 -6.26010101
## 2005 -9.08585859 -27.42676768 -6.61237374 1.34217172 14.86489899
## Jun Jul Aug Sep Oct
## 1994 NA 12.47853535 13.69823232 16.04292929 5.22095960
## 1995 15.60732323 11.64520202 -3.80176768 7.29292929 -4.32070707
## 1996 -19.97601010 -16.39646465 -8.09343434 9.41792929 12.38762626
## 1997 17.19065657 20.89520202 17.94823232 5.50126263 -0.61237374
## 1998 3.89898990 -5.47979798 -8.80176768 -13.12373737 -1.61237374
## 1999 -7.51767677 -26.18813131 -17.76010101 -7.99873737 -8.94570707
## 2000 -14.93434343 5.02020202 -0.38510101 1.08459596 7.01262626
## 2001 3.02398990 0.93686869 -8.30176768 -2.33207071 -0.07070707
## 2002 -0.22601010 -7.39646465 9.11489899 -7.12373737 -9.23737374
## 2003 -4.68434343 -1.02146465 -2.92676768 -6.49873737 -0.02904040
## 2004 -11.89267677 6.64520202 10.44823232 -1.12373737 1.34595960
## 2005 20.64898990 NA NA NA NA
## Nov Dec
## 1994 6.06565657 -6.77904040
## 1995 -3.43434343 -10.61237374
## 1996 13.19065657 7.72095960
## 1997 -4.60101010 -5.32070707
## 1998 0.19065657 9.97095960
## 1999 1.60732323 9.88762626
## 2000 1.35732323 -7.15404040
## 2001 4.19065657 -3.40404040
## 2002 -6.93434343 3.63762626
## 2003 -0.01767677 9.30429293
## 2004 -10.47601010 -6.11237374
## 2005 NA NA
##
## $figure
## [1] 25.16919 -82.90657 75.61237 45.65783 97.34343 16.80934 12.89646
## [8] -13.32323 -71.29293 -27.92929 -73.19066 -4.84596
##
## $type
## [1] "additive"
72
##
## attr(,"class")
## [1] "decomposed.ts"
73
##
## Coefficients:
## ar1 sar1 sar2 sma1 sma2 drift
## 0.8638 0.0607 -0.4074 -1.0121 0.4831 2.1882
## s.e. 0.0475 0.1862 0.1173 0.1994 0.1881 0.2174
##
## sigma^2 = 137.9: log likelihood = -518.84
## AIC=1051.67 AICc=1052.57 BIC=1071.85
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.1211196 10.98512 8.342375 -0.01115387 0.5520753 0.2685838
## ACF1
## Training set -0.08850265
The forecasting comparison evaluates both ARIMA and SARIMA models by comparing their forecasts to
actual values.
h_milk <- 12
milk_actual_values <- milk_data[(length(milk_data) - h_milk + 1):length(milk_data)]
74
arima_milk_forecast <- forecast(arima_milk, h = h_milk)$mean
sarima_milk_forecast <- forecast(sarima_milk, h = h_milk)$mean
time_points_milk <- time(milk_data)[(length(milk_data) - h_milk + 1):length(milk_data)]
plot_forecast_comparison(milk_actual_values, arima_milk_forecast, sarima_milk_forecast,
time_points_milk)
Forecast Comparison
1550 1600 1650 1700 1750
Actual Values
ARIMA (RMSE = 64.74 )
SARIMA (RMSE = 28.77 )
Values
Time
5. Conclusion
This document presented a comprehensive analysis of the AirPassengers and Monthly Milk Produc-
tion datasets using ARIMA and SARIMA models. The analysis included exploratory data visualization,
decomposition, model fitting, and forecasting. Key insights are summarized as follows:
• ARIMA Models: Effective for capturing non-seasonal trends and patterns. However, they might
underperform for datasets with strong seasonal components.
• SARIMA Models: Demonstrated superior performance in handling seasonal variations, especially
evident in datasets like AirPassengers and Monthly Milk Production.
• Model Comparison: The comparison of forecasting accuracy metrics such as RMSE highlighted the
relative strengths of ARIMA and SARIMA models for each dataset.
The results underscore the importance of selecting models that align with the inherent characteristics of the
time series data. While ARIMA offers simplicity and robustness for non-seasonal data, SARIMA excels in
datasets where seasonality plays a significant role. Future work could explore hybrid approaches or advanced
techniques like machine learning to enhance forecasting accuracy.
75
Program - 10
DATE - 20/12/2024
1. Introduction
This report explores the ‘gapminder’ dataset and creates interactive visualizations using the ‘plotly’ library.
The dataset provides valuable insights into global trends in life expectancy, GDP per capita, and population
across countries and continents over time. By leveraging interactive plots, the analysis becomes more
engaging and accessible for users. Finally, these visualizations are integrated into an interactive dashboard
for a comprehensive exploration experience..
data("gapminder")
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
76
5. Scatter Plot: GDP vs Life Expectancy by Continent
This visualization shows the relationship between GDP per capita and life expectancy, with the data grouped
by continent. It helps uncover how wealth (GDP per capita) correlates with health outcomes (life expectancy)
across regions. The size of the points represents the population of each country. Hovering over the points
reveals additional details, such as the country name and its GDP per capita.
scatter_plot
80 pop
2.50e+08
5.00e+08
7.50e+08
1.00e+09
Life Expectancy
60
1.25e+09
continent
Africa
40 Americas
Asia
Europe
Oceania
This bar chart highlights the life expectancy of each country for the year 2007. It allows for quick identifi-
cation of countries with the highest and lowest life expectancies in the dataset. The bars are labeled with
the country name and its life expectancy, making it easier to compare values across countries.
# Filter for year 2007 and create a bar chart of life expectancy by country
bar_chart <- gapminder %>%
filter(year == 2007) %>%
arrange(desc(lifeExp)) %>%
ggplot(aes(x = reorder(country, lifeExp), y = lifeExp, fill = continent)) +
77
geom_bar(stat = "identity") +
coord_flip() + # Flip for better readability
labs(title = "Life Expectancy by Country in 2007",
x = "Country",
y = "Life Expectancy") +
theme_minimal()
bar_chart
Saudi Arabia
Jamaica
Jordan
Romania
SriAlgeria
Lanka
Brazil Americas
Dominican Republic
Lebanon
ElParaguay
Salvador
Turkey
Philippines
Peru
Egypt
Morocco Asia
Iran
Indonesia
Thailand
Trinidad Guatemala
Honduras
Korea,andDem.Tobago
Rep.
Mongolia
Bolivia Europe
Sao Tome andComoros
Principe
Pakistan
India
Mauritania
Bangladesh
Yemen, Nepal
Senegal
Rep. Oceania
MyanmarHaiti
Ghana
Cambodia
GambiaIraq
Madagascar
Sudan
Togo
Eritrea
Niger
Gabon
Congo, Benin
Guinea
Rep.
Djibouti
Mali
Kenya
Ethiopia
Namibia
Tanzania
Burkina
Equatorial Faso
Guinea
Uganda
Botswana
Chad
Cameroon
South
Cote Burundi
Africa
d'Ivoire
Malawi
Somalia
Nigeria
Congo, Dem.
Guinea−Bissau Rep.
Rwanda
Central African Liberia
Republic
Afghanistan
Zimbabwe
Angola
Lesotho
SierraZambia
Leone
Mozambique
Swaziland
0 20 40 60 80
Life Expectancy
This line chart visualizes the trends in life expectancy over time for countries in Asia. It provides a detailed
view of how life expectancy has evolved in different Asian countries, highlighting variations and patterns.
Each line represents a country, enabling a comparative view of how life expectancy has changed from year
to year.
# Filter data for Asia and create a line chart showing life expectancy trends over time
line_chart <- gapminder %>%
filter(continent == "Asia") %>%
ggplot(aes(x = year, y = lifeExp, color = country, group = country)) +
geom_line() +
labs(title = "Life Expectancy Trend in Asia",
x = "Year",
y = "Life Expectancy") +
theme_minimal()
line_chart
78
Life Expectancy Trend in Asia country
Afghanistan Malaysia
80 Bahrain Mongolia
Bangladesh Myanmar
Cambodia Nepal
70 China Oman
Hong Kong, China Pakistan
Life Expectancy
India Philippines
60
Indonesia Saudi Arabia
Iran Singapore
To provide a holistic view, the scatter plot, bar chart, and line chart are integrated into an interactive
dashboard. This layout allows users to explore multiple aspects of the data simultaneously. It combines the
key insights from individual plots into a single interface, making it easier to derive meaningful conclusions.
# Combine the scatter, bar, and line charts into one interactive layout
dashboard <- subplot(scatter_plot, bar_chart, line_chart, nrows = 1) %>%
layout(title = 'Gapminder Data Visualization')
79
Gapminder Data Visualization
Hong Kong, Japan
China
Iceland
Switzerland
Australia
Spain
Sweden
80 New
Israel
France
Canada
Italy
Zealand
Norway 80
country
Singapore
Austria
Netherlands
Greece
Belgium
United Germany
Kingdom
Finland Africa
Costa
Puerto
Korea, Ireland
Rica
Rico
Rep.
DenmarkChile
Taiwan
Cuba
United States
Portugal Africa
Czech Slovenia
Kuwait
Republic
Reunion
Albania
Uruguay 70
Mexico
Croatia
Oman
Bahrain
Poland
Panama
Argentina
Ecuador
Bosnia and Herzegovina
Slovak Republic
Montenegro
Vietnam Americas
Malaysia
Syria
Serbia
Libya
and Tunisia
BankVenezuela
West 60 Gaza
Hungary
Bulgaria
China
Nicaragua
Colombia 60 Americas
SaudiMauritius
Arabia
Jamaica
Jordan
Romania
Sri Lanka
Brazil
Algeria
Dominican El Republic
Lebanon
Salvador
Turkey
Paraguay
Philippines
Peru
Egypt
Morocco
Indonesia
ThailandIran Asia
Trinidad
Korea, Guatemala
andHonduras
Dem. Tobago
Rep. 50
Sao Tome and Mongolia
Bolivia
Principe
Pakistan Asia
ComorosIndia
Mauritania
Bangladesh
Yemen, Nepal
Senegal
Myanmar Rep.
Cambodia
40 Madagascar Haiti
GhanaIraq
Gambia
Sudan
Togo
Eritrea
Niger
Gabon
Benin 40 Europe
Congo, Guinea
Rep.
Djibouti
Mali
Kenya
Ethiopia
Namibia
Tanzania
Burkina
Equatorial Faso
Guinea
Uganda
Botswana Europe
Cameroon
South Chad
Burundi
Africa
Cote d'Ivoire
Malawi
Somalia
Nigeria
Congo, Dem.
Guinea-Bissau
RwandaRep.
Liberia 30
Central African Republic
Afghanistan
Zimbabwe
Angola Oceania
Sierra Lesotho
Leone
Zambia
Mozambique
Swaziland
1e+03 1e+04 1e+05 0 20406080
1950
1960
1970
1980
1990
2000
9. Conclusion
The visualizations above provide insights into global trends in life expectancy, GDP per capita, and pop-
ulation. By combining interactive elements, users are empowered to explore patterns, relationships, and
disparities in the data. Using the ‘plotly’ library, dynamic and visually engaging visualizations have been
created, enhancing data exploration and analysis.
80