0% found this document useful (0 votes)
0 views9 pages

Introduction R For DS

This document details a project analyzing global COVID-19 testing data using R, focusing on data extraction, preprocessing, and analysis. It includes tasks such as fetching data from Wikipedia, cleaning the dataset, and calculating the worldwide positive testing ratio. The final output is a CSV file containing the processed data and insights for a news feature story.

Uploaded by

tugasyunikuliah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views9 pages

Introduction R For DS

This document details a project analyzing global COVID-19 testing data using R, focusing on data extraction, preprocessing, and analysis. It includes tasks such as fetching data from Wikipedia, cleaning the dataset, and calculating the worldwide positive testing ratio. The final output is a CSV file containing the processed data and insights for a news feature story.

Uploaded by

tugasyunikuliah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Global COVID-19 Testing Analysis Using R

Yuni Astari

2025-04-15

Introduction

The COVID-19 pandemic has significantly impacted countries worldwide, influencing public health systems,
economies, and daily life. One of the most important strategies in managing the spread of the virus has
been large-scale testing. Accurate and accessible testing data is essential to understanding the scope of the
outbreak and implementing timely interventions.
In this project, I take on the role of a data analyst for a news channel’s data science team. The team is
preparing a feature story on global COVID-19 testing efforts, and I have been assigned to gather and analyze
real-world testing data to support the story with data-driven insights.

#install.packages("httr")
#install.packages("rvest")
library(httr)

## Warning: package ’httr’ was built under R version 4.4.3

library(rvest)

## Warning: package ’rvest’ was built under R version 4.4.3

Task 1: Get a COVID-19 pandemic Wiki page using HTTP request

get_wiki_covid19_page <- function() {


wiki_url <- "https://en.wikipedia.org/w/index.php"
response <- GET(wiki_url, query = list(title = "Template:COVID-19_testing_by_country"))
return(response)
}

get_wiki_covid19_page()

## Response [https://en.wikipedia.org/w/index.php?title=Template%3ACOVID-19_testing_by_country]
## Date: 2025-04-15 04:15
## Status: 200
## Content-Type: text/html; charset=UTF-8
## Size: 452 kB
## <!DOCTYPE html>
## <html class="client-nojs vector-feature-language-in-header-enabled vector-fea...

1
## <head>
## <meta charset="UTF-8">
## <title>Template:COVID-19 testing by country - Wikipedia</title>
## <script>(function(){var className="client-js vector-feature-language-in-heade...
## RLSTATE={"ext.globalCssJs.user.styles":"ready","site.styles":"ready","user.st...
## <script>(RLQ=window.RLQ||[]).push(function(){mw.loader.impl(function(){return...
## }];});});</script>
## <link rel="stylesheet" href="/w/load.php?lang=en&amp;modules=ext.cite.styles%...
## ...

Task 2: Extract COVID-19 testing data table from the wiki HTML page

wiki_extr <- read_html(get_wiki_covid19_page())


wiki_extr

## {html_document}
## <html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-pa
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="skin--responsive skin-vector skin-vector-search-vue mediawik ...

table <- html_nodes(wiki_extr, "table")


table

## {xml_nodeset (4)}
## [1] <table class="box-Update plainlinks ombox ombox-content ambox-Update" rol ...
## [2] <table class="wikitable plainrowheaders sortable collapsible autocollapse ...
## [3] <table class="plainlinks ombox mbox-small ombox-notice" role="presentatio ...
## [4] <table class="wikitable mw-templatedata-doc-params">\n<caption><p class=" ...

data_covid <- as.data.frame(html_table(table[2]))


head(data_covid)

## Country.or.region Date.a. Tested Units.b. Confirmed.cases.


## 1 Afghanistan 17 Dec 2020 154,767 samples 49,621
## 2 Albania 18 Feb 2021 428,654 samples 96,838
## 3 Algeria 2 Nov 2020 230,553 samples 58,574
## 4 Andorra 23 Feb 2022 300,307 samples 37,958
## 5 Angola 2 Feb 2021 399,228 samples 20,981
## 6 Antigua and Barbuda 6 Mar 2021 15,268 samples 832
## Confirmed..tested.. Tested..population.. Confirmed..population.. Ref.
## 1 32.1 0.40 0.13 [1]
## 2 22.6 15.0 3.4 [2]
## 3 25.4 0.53 0.13 [3][4]
## 4 12.6 387 49.0 [5]
## 5 5.3 1.3 0.067 [6]
## 6 5.4 15.9 0.86 [7]

Task 3: Pre-process and export the extracted data frame

2
summary(data_covid)

## Country.or.region Date.a. Tested Units.b.


## Length:173 Length:173 Length:173 Length:173
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## Confirmed.cases. Confirmed..tested.. Tested..population..
## Length:173 Length:173 Length:173
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## Confirmed..population.. Ref.
## Length:173 Length:173
## Class :character Class :character
## Mode :character Mode :character

preprocess_covid_data <- function(data_frame) {

shape <- dim(data_frame)

# Remove the "World" row


data_frame <- data_frame[!(data_frame$`Country.or.region` == "World"),]

# Remove the last row


data_frame <- data_frame[1:172, ]

# Remove unnecessary columns


data_frame["Ref."] <- NULL
data_frame["Units.b."] <- NULL

# Renaming the columns


names(data_frame) <- c("country", "date", "tested", "confirmed", "confirmed.tested.ratio",
"tested.population.ratio", "confirmed.population.ratio")

# Convert column data types


data_frame$country <- as.factor(data_frame$country)
data_frame$date <- as.factor(data_frame$date)
data_frame$tested <- as.numeric(gsub(",", "", data_frame$tested))
data_frame$confirmed <- as.numeric(gsub(",", "", data_frame$confirmed))
data_frame$confirmed.tested.ratio <- as.numeric(gsub(",", "", data_frame$confirmed.tested.ratio))
data_frame$tested.population.ratio <- as.numeric(gsub(",", "", data_frame$tested.population.ratio))
data_frame$confirmed.population.ratio <- as.numeric(gsub(",", "", data_frame$confirmed.population.ra

return(data_frame)
}

proper_data_covid<- preprocess_covid_data(data_covid)
head(proper_data_covid)

## country date tested confirmed confirmed.tested.ratio


## 1 Afghanistan 17 Dec 2020 154767 49621 32.1
## 2 Albania 18 Feb 2021 428654 96838 22.6
## 3 Algeria 2 Nov 2020 230553 58574 25.4

3
## 4 Andorra 23 Feb 2022 300307 37958 12.6
## 5 Angola 2 Feb 2021 399228 20981 5.3
## 6 Antigua and Barbuda 6 Mar 2021 15268 832 5.4
## tested.population.ratio confirmed.population.ratio
## 1 0.40 0.130
## 2 15.00 3.400
## 3 0.53 0.130
## 4 387.00 49.000
## 5 1.30 0.067
## 6 15.90 0.860

summary(proper_data_covid)

## country date tested


## Afghanistan : 1 2 Feb 2023 : 6 Min. : 3880
## Albania : 1 1 Feb 2023 : 4 1st Qu.: 512037
## Algeria : 1 31 Jan 2023: 4 Median : 3029859
## Andorra : 1 1 Mar 2021 : 3 Mean : 31377219
## Angola : 1 23 Jul 2021: 3 3rd Qu.: 12386725
## Antigua and Barbuda: 1 29 Jan 2023: 3 Max. :929349291
## (Other) :166 (Other) :149
## confirmed confirmed.tested.ratio tested.population.ratio
## Min. : 0 Min. : 0.00 Min. : 0.006
## 1st Qu.: 37839 1st Qu.: 5.00 1st Qu.: 9.475
## Median : 281196 Median :10.05 Median : 46.950
## Mean : 2508340 Mean :11.25 Mean : 175.504
## 3rd Qu.: 1278105 3rd Qu.:15.25 3rd Qu.: 156.500
## Max. :90749469 Max. :46.80 Max. :3223.000
##
## confirmed.population.ratio
## Min. : 0.000
## 1st Qu.: 0.425
## Median : 6.100
## Mean :12.769
## 3rd Qu.:16.250
## Max. :74.400
##

write.csv(proper_data_covid,file='covid-19(2023).csv',row.names=FALSE)

# Get working directory


wd <- getwd()
# Get exported
file_path <- paste(wd, sep="", "/covid.csv")
# File path
print(file_path)

## [1] "C:/Users/LENOVO/OneDrive/Documents/Coursera/IBM/covid.csv"

file.exists(file_path)

## [1] FALSE

4
# My saved file with new name
file_path <- paste(wd, sep="", "/covid-19(2023).csv")
print(file_path)

## [1] "C:/Users/LENOVO/OneDrive/Documents/Coursera/IBM/covid-19(2023).csv"

file.exists(file_path)

## [1] TRUE

Task 4: Get a subset of the extracted data frame

# Read covid_data_frame_csv from the csv file


#read.csv("covid-19(2023).csv")
covid_data <- read.csv("covid-19(2023).csv")
# Get the 5th to 10th rows, with two "country" "confirmed" columns
covid_data[5:10,c('country','confirmed')]

## country confirmed
## 5 Angola 20981
## 6 Antigua and Barbuda 832
## 7 Argentina 9060495
## 8 Armenia 422963
## 9 Australia 10112229
## 10 Austria 5789991

Task 5: Calculate worldwide COVID testing positive ratio

# Get the total confirmed cases worldwide


tot_confirmed <- sum(covid_data[,'confirmed'])
tot_confirmed

## [1] 431434555

# Get the total tested cases worldwide


tot_tested <- sum(covid_data[,'tested'])
tot_tested

## [1] 5396881644

# Get the positive ratio (confirmed / tested)


positive_ratio <- tot_confirmed/tot_tested
round(positive_ratio,2)

## [1] 0.08

Task 6: Get a sorted name list of countries that reported their testing data

5
covid_data$country <- as.character(covid_data$country)

# Sort A to Z
sort(covid_data$country)

## [1] "Afghanistan" "Albania" "Algeria"


## [4] "Andorra" "Angola" "Antigua and Barbuda"
## [7] "Argentina" "Armenia" "Australia"
## [10] "Austria" "Azerbaijan" "Bahamas"
## [13] "Bahrain" "Bangladesh" "Barbados"
## [16] "Belarus" "Belgium" "Belize"
## [19] "Benin" "Bhutan" "Bolivia"
## [22] "Bosnia and Herzegovina" "Botswana" "Brazil"
## [25] "Brunei" "Bulgaria" "Burkina Faso"
## [28] "Burundi" "Cambodia" "Cameroon"
## [31] "Canada" "Chad" "Chile"
## [34] "China[c]" "Colombia" "Costa Rica"
## [37] "Croatia" "Cuba" "Cyprus[d]"
## [40] "Czechia" "Denmark[e]" "Djibouti"
## [43] "Dominica" "Dominican Republic" "DR Congo"
## [46] "Ecuador" "Egypt" "El Salvador"
## [49] "Equatorial Guinea" "Estonia" "Eswatini"
## [52] "Ethiopia" "Faroe Islands" "Fiji"
## [55] "Finland" "France[f][g]" "Gabon"
## [58] "Gambia" "Georgia[h]" "Germany"
## [61] "Ghana" "Greece" "Greenland"
## [64] "Grenada" "Guatemala" "Guinea"
## [67] "Guinea-Bissau" "Guyana" "Haiti"
## [70] "Honduras" "Hungary" "Iceland"
## [73] "India" "Indonesia" "Iran"
## [76] "Iraq" "Ireland" "Israel"
## [79] "Italy" "Ivory Coast" "Jamaica"
## [82] "Japan" "Jordan" "Kazakhstan"
## [85] "Kenya" "Kosovo" "Kuwait"
## [88] "Kyrgyzstan" "Laos" "Latvia"
## [91] "Lebanon" "Lesotho" "Liberia"
## [94] "Libya" "Lithuania" "Luxembourg[i]"
## [97] "Madagascar" "Malawi" "Malaysia"
## [100] "Maldives" "Mali" "Malta"
## [103] "Mauritania" "Mauritius" "Mexico"
## [106] "Moldova[j]" "Mongolia" "Montenegro"
## [109] "Morocco" "Mozambique" "Myanmar"
## [112] "Namibia" "Nepal" "Netherlands"
## [115] "New Caledonia" "New Zealand" "Niger"
## [118] "Nigeria" "North Korea" "North Macedonia"
## [121] "Northern Cyprus[k]" "Norway" "Oman"
## [124] "Pakistan" "Palestine" "Panama"
## [127] "Papua New Guinea" "Paraguay" "Peru"
## [130] "Philippines" "Poland" "Portugal"
## [133] "Qatar" "Romania" "Russia"
## [136] "Rwanda" "Saint Kitts and Nevis" "Saint Lucia"
## [139] "Saint Vincent" "San Marino" "Saudi Arabia"
## [142] "Senegal" "Serbia" "Singapore"

6
## [145] "Slovakia" "Slovenia" "South Africa"
## [148] "South Korea" "South Sudan" "Spain"
## [151] "Sri Lanka" "Sudan" "Sweden"
## [154] "Switzerland[l]" "Taiwan[m]" "Tanzania"
## [157] "Thailand" "Togo" "Trinidad and Tobago"
## [160] "Tunisia" "Turkey" "Uganda"
## [163] "Ukraine" "United Arab Emirates" "United Kingdom"
## [166] "United States" "Uruguay" "Uzbekistan"
## [169] "Venezuela" "Vietnam" "Zambia"
## [172] "Zimbabwe"

# Sort Z to A
ztoa_country <- sort(covid_data$country, decreasing = TRUE)
print(ztoa_country)

## [1] "Zimbabwe" "Zambia" "Vietnam"


## [4] "Venezuela" "Uzbekistan" "Uruguay"
## [7] "United States" "United Kingdom" "United Arab Emirates"
## [10] "Ukraine" "Uganda" "Turkey"
## [13] "Tunisia" "Trinidad and Tobago" "Togo"
## [16] "Thailand" "Tanzania" "Taiwan[m]"
## [19] "Switzerland[l]" "Sweden" "Sudan"
## [22] "Sri Lanka" "Spain" "South Sudan"
## [25] "South Korea" "South Africa" "Slovenia"
## [28] "Slovakia" "Singapore" "Serbia"
## [31] "Senegal" "Saudi Arabia" "San Marino"
## [34] "Saint Vincent" "Saint Lucia" "Saint Kitts and Nevis"
## [37] "Rwanda" "Russia" "Romania"
## [40] "Qatar" "Portugal" "Poland"
## [43] "Philippines" "Peru" "Paraguay"
## [46] "Papua New Guinea" "Panama" "Palestine"
## [49] "Pakistan" "Oman" "Norway"
## [52] "Northern Cyprus[k]" "North Macedonia" "North Korea"
## [55] "Nigeria" "Niger" "New Zealand"
## [58] "New Caledonia" "Netherlands" "Nepal"
## [61] "Namibia" "Myanmar" "Mozambique"
## [64] "Morocco" "Montenegro" "Mongolia"
## [67] "Moldova[j]" "Mexico" "Mauritius"
## [70] "Mauritania" "Malta" "Mali"
## [73] "Maldives" "Malaysia" "Malawi"
## [76] "Madagascar" "Luxembourg[i]" "Lithuania"
## [79] "Libya" "Liberia" "Lesotho"
## [82] "Lebanon" "Latvia" "Laos"
## [85] "Kyrgyzstan" "Kuwait" "Kosovo"
## [88] "Kenya" "Kazakhstan" "Jordan"
## [91] "Japan" "Jamaica" "Ivory Coast"
## [94] "Italy" "Israel" "Ireland"
## [97] "Iraq" "Iran" "Indonesia"
## [100] "India" "Iceland" "Hungary"
## [103] "Honduras" "Haiti" "Guyana"
## [106] "Guinea-Bissau" "Guinea" "Guatemala"
## [109] "Grenada" "Greenland" "Greece"
## [112] "Ghana" "Germany" "Georgia[h]"
## [115] "Gambia" "Gabon" "France[f][g]"

7
## [118] "Finland" "Fiji" "Faroe Islands"
## [121] "Ethiopia" "Eswatini" "Estonia"
## [124] "Equatorial Guinea" "El Salvador" "Egypt"
## [127] "Ecuador" "DR Congo" "Dominican Republic"
## [130] "Dominica" "Djibouti" "Denmark[e]"
## [133] "Czechia" "Cyprus[d]" "Cuba"
## [136] "Croatia" "Costa Rica" "Colombia"
## [139] "China[c]" "Chile" "Chad"
## [142] "Canada" "Cameroon" "Cambodia"
## [145] "Burundi" "Burkina Faso" "Bulgaria"
## [148] "Brunei" "Brazil" "Botswana"
## [151] "Bosnia and Herzegovina" "Bolivia" "Bhutan"
## [154] "Benin" "Belize" "Belgium"
## [157] "Belarus" "Barbados" "Bangladesh"
## [160] "Bahrain" "Bahamas" "Azerbaijan"
## [163] "Austria" "Australia" "Armenia"
## [166] "Argentina" "Antigua and Barbuda" "Angola"
## [169] "Andorra" "Algeria" "Albania"
## [172] "Afghanistan"

Task 7: Identify country names with a specific pattern

# Find country names that contain a space (i.e., countries with multiple words in their name)
space_matches <- grep(" ", covid_data$country, value = TRUE)
print(space_matches)

## [1] "Antigua and Barbuda" "Bosnia and Herzegovina" "Burkina Faso"


## [4] "Costa Rica" "Dominican Republic" "DR Congo"
## [7] "El Salvador" "Equatorial Guinea" "Faroe Islands"
## [10] "Ivory Coast" "New Caledonia" "New Zealand"
## [13] "North Korea" "North Macedonia" "Northern Cyprus[k]"
## [16] "Papua New Guinea" "Saint Kitts and Nevis" "Saint Lucia"
## [19] "Saint Vincent" "San Marino" "Saudi Arabia"
## [22] "South Africa" "South Korea" "South Sudan"
## [25] "Sri Lanka" "Trinidad and Tobago" "United Arab Emirates"
## [28] "United Kingdom" "United States"

Task 8: Pick two countries you are interested in, and then review their testing data

india <- covid_data[covid_data$country == "India",


c("country", "tested", "confirmed", "confirmed.population.ratio")]
germany<- covid_data[covid_data$country == "Germany",
c("country", "tested", "confirmed", "confirmed.population.ratio")]
india

## country tested confirmed confirmed.population.ratio


## 73 India 866177937 43585554 31.7

8
germany

## country tested confirmed confirmed.population.ratio


## 60 Germany 65247345 3733519 4.5

Task 9: Compare which one of the selected countries has a larger ratio of confirmed cases to
population

# Use if-else statement


if (germany$confirmed.population.ratio > india$confirmed.population.ratio) {
print("Germany has a higher COVID-19 infection rate per population than India.")
} else {
print("India has a higher COVID-19 infection rate per population than Germany.")
}

## [1] "India has a higher COVID-19 infection rate per population than Germany."

Task 10: Find countries with confirmedcases to population ratio rate less than a threshold

# Get a subset of any countries with `confirmed.population.ratio` less than the threshold
low_risk_countries <- covid_data[(covid_data$`confirmed.population.ratio` <1), ]
head(low_risk_countries)

## country date tested confirmed confirmed.tested.ratio


## 1 Afghanistan 17 Dec 2020 154767 49621 32.1
## 3 Algeria 2 Nov 2020 230553 58574 25.4
## 5 Angola 2 Feb 2021 399228 20981 5.3
## 6 Antigua and Barbuda 6 Mar 2021 15268 832 5.4
## 14 Bangladesh 24 Jul 2021 7417714 1151644 15.5
## 19 Benin 4 May 2021 595112 7884 1.3
## tested.population.ratio confirmed.population.ratio
## 1 0.40 0.130
## 3 0.53 0.130
## 5 1.30 0.067
## 6 15.90 0.860
## 14 4.50 0.700
## 19 5.10 0.067

You might also like